Google Speech to Text API Word Error Rate Analysis Tool

Note, this tool uses the v1p1beta1 speech api endpoint

Google Cloud Speech to Text API offers a number of models and options. This program is designed to help determine the API options that provide the lowest Word Error Rate (WER).

Terms and abbreviations

Reference, ground truth, expected, gold standard : file containing text transcription produced by humans.
Hypothesis: Transcript text derived from Speech to Text
STT: Speech to Text
WER: Word Error Rate

Setup

Create a developer account

If you do not have a google account, create one
Go to Cloud Speech to Text
Click Try it Free or Get started for Free
Sign in
In Step 1 of 2 select your country, agree to terms, and click Continue
In Step 2 of 2 fill out all the information and supply a payment method (Speech to Text is free unless you hit some really high caps)

Enable Cloud Speech-to-Text API

Get credentials

After enabling billing above, the API details are displayed. In the right upper corner click Create Credentials
In the step "Find out what kind of credentials you need" pull down menu, select Cloud Speech-to-Text API
Select the "No, I'm not using them" radio button in the question "Are you planning to use this API with App Engine or Compute Engine?"
Click What credentials do I need
In the step "Create a service account", enter any name in the Service account name field (example: speech)
In the Role pull down menu select Project > Owner
The radio button for Key type should default to JSON. Leave it at this default
Click Continue
A dialog will appear stating "Service account and key created" and a JSON file will download

Use of credentials JSON file

Store the file anyplace you wish. (I like to leave it at root level)
Follow the applicable steps in Set the environment variable GOOGLE_APPLICATION_CREDENTIALS

Using the JSON with a python program

Python Environment

Create a python environment as described in Python Setup. See also:

Files

For best results, follow Best Practices
Single directory containing both audio and reference text files
Audio files must all be of a single supported encoding type
Audio files must all have a single sample rate

File naming

Name the reference files with the same root name as the audio. Reference files must end in .txt

Example:

audio_file_1.mp3
audio_file_1.txt
audio_file_2.mp3
audio_file_2.txt

Reference files

A reference file should contain only text that is a human derived transcription of the audio. This text can be called reference text, expected results text, or ground truth text. There is no need to add punctuation, or any kind of markup. Punctuation will be removed and any other markup might produce anomalous WER results. Somewhat standard transcription marking such as [cough] or [laugh] will be removed automatically, any other type of transcription markings that do not fall in brackets will cause miscalculated WER

Usage

There are a number of command line options available

python3 run.py -cs "gs://some/bucket" -lr local_result_folder -m video phone call -l es-MX -hz 44200 -p phrase-file.txt -b 10 30 50 -alts es-ES es-DO -n LINEAR16

Required command line parameters

-cs, --cloud_store_uri: Cloud storage uri where audio and ground truth expected reference transcriptions are stored
-lr, --local_results_path: Local path to store generated results
-n, --encoding: Specifies audio encoding type

Optional

Speech to Text features

-m, --models: Space separated list of models to evaluate. Example "video phone"
-l, --langs: Language code. Default is en-US
-hz, --sample_rate_hertz: Specifies the sample rate. Example: 48000
-e, --enhanced: Use the enhanced model(s) if specified by -m
-ch, --multi: Integer indicating the number of channels if more than one
-p, --phrase_file: Path to file containing comma separated phrases for speech adaptation
-b, --boosts: Space separated list of integers for boosts to apply for speech adaptation
-alts, --alternative_languages: Space separated list of language codes for auto language detection. Example en-IN en-US en-GB")

Natural Language Processing

Useful for some use cases where accurate transcription of every single word is not critical. This may include sentiment analysis and labeling applications

-ex, --expand: Expand all contractions. Example aren't = are not"
-stop, --remove_stop_words: Remove stop words from all text
-stem, --stem: Apply stemming
-nw, --numbers_to_words: Convert numbers to words. Example "order number 123" = "order number one two three"

Other

-to, --transcriptions_only: If specified the only output will be transcripts, no analysis will be done
-a, --alts2prime: Use each alternative language as a primary language. Helpful when audio might be mixed for example en-US en-GB en-AU
-q, --random_queue: Replace default queue.txt with randomly named queue file ####_queue.txt
-fake, --fake_hyp: Use a fake hypothesis for testing. This will allow skipping sending audio to the API
-limit, --limit: Limits audio processing to int number. Useful in cases where there is a lot of audio and you want to get some quick results first

Speech adaptation

If using phrases or classes by specifying the command line argument -p, the file must be a text file. It should contain a comma seperated list of phrases and/or classes.

some_phrase_file.txt:

pizza, burger, fries, pickles

Reports and diagnostics

Detailed report

A results.csv file will be written to the results directory you specified on the command line. The file contains the results of transcribing audio for each of the API options you specified

Diagnostic HTML file

An html file will be written for each transcription in a similar naming pattern. The HTML file highlights differences between the reference texts and the API results.

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
docs		docs
model		model
tests		tests
utilities		utilities
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README.rst		README.rst
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

License

bbookman/Google-Speech-to-Text-API-Word-Error-Rate-Analysis-Tool

Folders and files

Latest commit

History

Repository files navigation