Speech Transcription with Whisper v1.0

Date completed	November 14, 2023
Release where first appeared	OpenWillis v1.6
Researcher / Developer	Vijay Yadav, Anzar Abbas

1 – Use

import openwillis as ow

transcript_json, transcript_text = ow.speech_transcription_whisper(filepath = '', model = '', hf_token = '', language = '', min_speakers = '', max_speakers = '', context = '')

2 – Methods

This function transcribes speech into text using WhisperX, an open source library built atop Whisper, a speech transcription model developed by OpenAI. The function allows for offline speech transcription where computational resources are available. It is best suited for researchers transcribing speech on institutional or cloud-based machines, ideally with GPU access.

The function allows for multiple models, tiny, small, medium, large and large-v2.

The tiny model, with 39M parameters, is suitable for personal machines and can handle relatively short audio files.
The small model, with 244M parameters, offers a balance of performance and file length.
The medium model, with 769M parameters, is a step up for processing longer audio files.
On the other hand, the large and large-v2 models, with 1550M parameters, demand GPU resources for efficient processing of large recordings.

Naturally, transcription accuracy will be higher with the larger models. Refer to the WhisperX instruction set for further information on the computational resources required.

The user will also need a Hugging Face token to access the underlying models. To acquire a token, they will need to create an account on Hugging Face, accept user conditions for the segmentation, voice activity detection, and diarization models, and go to account settings to access their token.

Whisper can support several languages/dialects, namely Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh. The language codes for each can be found below.

If no language is specified, Whisper will automatically detect the language and process it as such. However, language classification accuracy may vary depending on the language and its similarity to other languages. We recommend always specifying the language, if known.

2.2.1 – Transcription only

In case of a straightforward transcription, the function simply passes the source audio through Whisper.

In transcript_json, a word-by-word and phrase-by-phrase transcript is saved as a JSON.

In transcript_text, a string of the entire transcript, including punctuation, is saved.

2.2.2 – Transcription + Speaker labeling

When the source audio contains multiple speakers, the function will automatically label each speaker in the output JSON as speaker0, speaker1, speaker2, and so on at both the word and the phrase level.

If the number of speakers in the source audio is known to the user, they may specify max_speakers = n to limit the number of speakers that will be labeled. Please note that this is only an upper limit. If the model detects fewer than the expected number of speakers, it will output the JSON as such.

As is to be expected, speaker labeling is not 100% accurate.

For phrases that are labeled as one speaker but contain words labeled for more than one speaker, each word’s speaker label will be reassigned to the phrase-level identification if it does not match the phrase’s speaker label.

2.2.3 – Transcription + Speaker labeling + Speaker identification

If the source audio is a recording of a structured clinical interview, the user may specify the scale being administered using the context parameter and the function will automatically identify which of the speakers is the clinician and which of the speakers is the participant.

Then, at the word and phrase level, the speaker label is changed to either clinician or participant.

Note: When context is specified, the function assumes that the source audio contains two speakers and sets speaker_labels = 'True' and max_speakers = 2. These no longer need to be specified by the user.

Currently supported contexts are two clinical scales:

MADRS, conducted in accordance with SIGMA, for which context = 'madrs'.
PANSS, conducted in accordance with the SCI-PANSS, for which context = 'panss'.

Speaker identification is done by comparing the transcribed text with expected rater prompts from the clinical scale. This comparison is done using a pre-trained multilingual sentence transformer model, which works by mapping different sentences to the pre-trained embedding space, based on their underlying meaning. The embeddings are compared using cosine similarity. The closer the embeddings are in similarity, the more similar they are in meaning. The speaker whose audio more closely matches the expected rater prompts is labeled as the clinician, while the other speaker is labeled as the participant. This comparison is done in a manner that is agnostic of the language the speech is in.

3 – Inputs

3.1 – `filepath`

Type	String
Description	A local path to the file that the user wants to process. Most audio and video file types are supported.

3.2 – `model`

Type	String; default is `tiny`
Description	Name of model the user wants to utilize, with the options being `tiny, small, medium, large and large-v2`.

3.3 – `hf_token`

Type	String
Description	The user’s Hugging Face token as described in the function methods

3.4 – `language`

Type	String; optional – if left empty; resorts to auto-detection
Description	Language of the source audio file

english	`en`
chinese	`zh`
german	`de`
spanish	`es`
russian	`ru`
korean	`ko`
french	`fr`
japanese	`ja`
portuguese	`pt`
turkish	`tr`
polish	`pl`
catalan	`ca`
dutch	`nl`
arabic	`ar`
swedish	`sv`
italian	`it`
indonesian	`id`
hindi	`hi`
finnish	`fi`
vietnamese	`vi`
hebrew	`he`
ukrainian	`uk`
greek	`el`
malay	`ms`
czech	`cs`
romanian	`ro`
danish	`da`
hungarian	`hu`
tamil	`ta`
norwegian	`no`
thai	`th`
urdu	`ur`
croatian	`hr`
bulgarian	`bg`
lithuanian	`lt`
latin	`la`
maori	`mi`
malayalam	`ml`
welsh	`cy`
slovak	`sk`
telugu	`te`
persian	`fa`
latvian	`lv`
bengali	`bn`
serbian	`sr`
azerbaijani	`az`
slovenian	`sl`
kannada	`kn`
estonian	`et`
macedonian	`mk`
breton	`br`
basque	`eu`
icelandic	`is`
armenian	`hy`
nepali	`ne`
mongolian	`mn`
bosnian	`bs`
kazakh	`kk`
albanian	`sq`
swahili	`sw`
galician	`gl`
marathi	`mr`
punjabi	`pa`
sinhala	`si`
khmer	`km`
shona	`sn`
yoruba	`yo`
somali	`so`
afrikaans	`af`
occitan	`oc`
georgian	`ka`
belarusian	`be`
tajik	`tg`
sindhi	`sd`
gujarati	`gu`
amharic	`am`
yiddish	`yi`
lao	`lo`
uzbek	`uz`
faroese	`fo`
haitian creole	`ht`
pashto	`ps`
turkmen	`tk`
nynorsk	`nn`
maltese	`mt`
sanskrit	`sa`
luxembourgish	`lb`
myanmar	`my`
tibetan	`bo`
tagalog	`tl`
malagasy	`mg`
assamese	`as`
tatar	`tt`
hawaiian	`haw`
lingala	`ln`
hausa	`ha`
bashkir	`ba`
javanese	`jw`
sundanese	`su`

3.5 – `max_speakers`

Type	Integer; optional
Description	This parameter sets the maximum number of speakers to be identified and labeled in audio or video transcriptions.

3.6 – `context`

Type	String; optional
Description	In case the source audio is the recording of a known clinical scale, specification of the clinical scale. Current options are `madrs` and `panss`. If scale is provided, `num_speakers` is assumed to be 2.

4 – Outputs

4.1 – `transcript_json`

Type	JSON
Description	This is a word-wise and phrase-wise transcript saved as a JSON

4.2 – `transcript_text`

Type	String
Description	The transcription, compiled into a string

5 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency	License	Justification
WhisperX	BSD-4	Library for transcribing and diarizing speakers in recordings.
Whisper	MIT	Library for transcribing speech
Pyannote	MIT	Library for speaker diarization
SentenceTransformer	Apache 2.0	A pre-trained BERT-based sentence transformer model to compute the similarity between the speech.

OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech Transcription with Whisper v1.0

1 – Use

2 – Methods

2.2.1 – Transcription only

2.2.2 – Transcription + Speaker labeling

2.2.3 – Transcription + Speaker labeling + Speaker identification

3 – Inputs

3.1 – `filepath`

3.2 – `model`

3.3 – `hf_token`

3.4 – `language`

3.5 – `max_speakers`

3.6 – `context`

4 – Outputs

4.1 – `transcript_json`

4.2 – `transcript_text`

5 – Dependencies

Table of contents

Clone this wiki locally

Speech Transcription with Whisper v1.0

1 – Use

2 – Methods

2.2.1 – Transcription only

2.2.2 – Transcription + Speaker labeling

2.2.3 – Transcription + Speaker labeling + Speaker identification

3 – Inputs

3.1 – filepath

3.2 – model

3.3 – hf_token

3.4 – language

3.5 – max_speakers

3.6 – context

4 – Outputs

4.1 – transcript_json

4.2 – transcript_text

5 – Dependencies

Table of contents

Clone this wiki locally

3.1 – `filepath`

3.2 – `model`

3.3 – `hf_token`

3.4 – `language`

3.5 – `max_speakers`

3.6 – `context`

4.1 – `transcript_json`

4.2 – `transcript_text`