Skip to content

This repository describes how to build an automatic speech recognition (ASR) model using SageMaker and DeepSpeech Library.

License

Notifications You must be signed in to change notification settings

aws-samples/amazon-sagemaker-automatic-speech-recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Automatic Speech Recognition using DeepSpeech on Amazon SageMaker

Train your own Automatic Speech Recognition (ASR) model and deploy it anywhere

Speech Recognition is the task of translating and converting spoken language into text. Audio data is virtually very difficult for computer systems to search and analyze. Therefore, recorded speech needs to be converted to text before it can be used in various different applications. Automatic Speech Recognition is the task of using deep supervised learning techniques to automatically and accurately convert speech into text. This code is an example of how to use DeepSpeech library to prepare, build, train and host a model using Amazon SageMaker.

DeepSpeech is an open source Speech-To-Text engine based on Baidu’s Deep Speech research paper which implements the DeepSpeech architecture in Tensorflow.

Data Preparation

In this example, we will use the Common Voice Dataset format to train an Arabic ASR model but the setup applies to any other language in the common voice dataset format. We will need to run a few steps to prepare the data, alphabet file, language model, and the scorer. Let's start with the data format:

Note: In this example, we use the Arabic dataset but you can replace the Arabic dataset with any other language and it will work as long as you follow the necessary preparation steps.

Data format:

CV dataset has three columns as below:

wav_filename wave_filesize transcript
common_voice_ar_22759417.wav 132908 سيسعدني مساعدتك أي وقت تحب
common_voice_ar_23675091.wav 101420 إنك تكبر المشكلة
common_voice_ar_23558552.wav 222764 ليست هناك مسافة على هذه الأرض أبعد من يوم أمس

Building the Language Model:

A language model is used to predict what words are more likely to follow each other in a sequence. To build the language model, I am using a python script provided with the Deep Speech library called generate_lm.py available HERE The script takes in a large corpora of text data input.txt.gz and does a few transformation steps into it:

  • Converts words to lowercase
  • Counts word occurences and saves the top-k most common words to a file.
  • Use KenLM Binary files to create the language model binary files

To create the Arabic language model, I run the following command on a large corpora of text:

python3 generate_lm.py --input_txt input.txt.gz --output_dir . --top_k 500000 --kenlm_bins native_client/kenlm/build/bin/ --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type triediscount_fallback

The output of this process is two files: lm.binary and vocab-<top_k>.txt files

Generate the scorer package:

The scorer package is a language model that is used to direct the beam search that happens during the decoding process to generate the output characters and words. The generate_scorer_package binary is available in the native_client package. To install the package, run the command below in the terminal:

sh-4.2$ mkdir native_clients; cd native_clients ; wget -c https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/native_client.amd64.cuda.linux.tar.xz && tar -Jxvf native_client.amd64.cuda.linux.tar.xz 

Once the files are downloaded and unpacked, you can generate the scorer package from the files generated at the previous step:

sh-4.2$ generate_scorer_package --alphabet ./alphabet.txt --lm lm.binary --vocab vocab-500000.txt --package kenlm.scorer --default_alpha 0.6560092006459668 --default_beta 2.3034529727156823

The default_alpha and default_beta values are used to assign initial wieghts to a sequence of words. The can be tuned and optimized after training a model.

Start Training

Refer to the Notebook for training and preparing the SageMaker container.

Roadmap

  • Build local Inference. (Completed)
  • Build a SageMaker Inference (WIP)
  • Include steps to optimize inference requests (tuning lm_alpha and lm_beta values involved in creating scorer) (WIP)
  • Multi-instance GPU training (WIP)
  • Support fine-tuning.
  • Support transfer learning.
  • Add steps to create alphabet, vocab files for any language.

Security

See CONTRIBUTING for more information.

License

This code is licensed under the MIT-0 License. See the LICENSE file.

About

This repository describes how to build an automatic speech recognition (ASR) model using SageMaker and DeepSpeech Library.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages