Speech transcription cloud v1.0

Date completed	July 13, 2023
Release where first appeared	OpenWillis v1.2
Researcher / Developer	Vijay Yadav

1 – Use

import openwillis as ow

json, transcript = ow.speech_transcription_cloud(filepath = 's3://bucket/file.wav', model = 'aws', language = 'en-US', region = 'us-east-1', job_name = 'test', ShowSpeakerLabels = True,
MaxSpeakerLabels = 2, access_key = 'xxxx', secret_key = 'xxxx', c_scale = 'panss')

2 – Methods

This function uses Amazon Transcribe to transcribe speech into text. Future versions of it will give the user flexibility to utilize other cloud-based transcription tools such as those offered by Azure or Google Cloud Platform.

Compared to the locally-running speech_transcription function, the cloud-based version can support more languages, multiple speakers, and is arguably more accurate. However, the user is expected to have AWS configured on their system and input access and secret keys that will allow the function to utilize their AWS account. To learn about how to set up AWS, check out their documentation.

The function has two outputs. The json output contains start times, end times, transcription confidence scores, and (in case ShowSpeakerLabels is true) speaker labels. The json also serves as an input to both the speaker_separation and speech_characteristics functions. The transcript output is simply a string of the entire transcription.

2.1 – Speaker separation

In the case of two speakers, each word in the output json is by default labeled either speaker0 or speaker1. This is done automatically by AWS. Using these labels, the user can process only the speech of a specific speaker for speech characteristics using the speech_characteristics function, which allows for filtering of the input json by speaker label before further processing.

2.2 – Speaker identification

In case the audio file is of a structured clinical interview such as the PANSS, the user can input the name of the scale in the c_scale argument and the function will use what it knows about the PANSS to automatically label the clinician and the participant instead of the speaker0 and speaker1 labels.

To perform transcription comparison, the function employs a technique described here. It systematically compares (semantic comparison) the transcribed text for each expected rater's prompt (currently, we have implemented this check specifically for PANNS prompts only). It then computes probability scores for the top five prompts that are deemed most probable to be present. These scores are subsequently averaged. By comparing the results, the transcribed text with the higher overall probability score is assigned the label of the clinician, whereas the other file is designated as belonging to the participant.

3 – Inputs

3.1 – `filepath`

Type	str
Description	URI path to audio file on S3 bucket; include all files format supported by AWS

3.2 – `language`

Type	default is ‘`en-US`’ i.e. US English;
Description	Amazon transcribe language codes: https://docs.aws.amazon.com/transcribe/latest/dg/supported-languages.html.

3.3 – `region`

Type	default is ‘`us-east-1`’
Description	aws region code

3.4 – `job_name`

Type	default is ‘`transcribe_job_01`’
Description	name of the transcription job, required by aws

3.5 – `ShowSpeakerLabels`

Type	Boolean; optional, default is True
Description	Identifies and labels different speakers in audio or video transcriptions

3.6 – `MaxSpeakerLabels`

Type	Boolean; optional, default is 2
Description	This parameter sets the maximum number of speakers to be identified and labeled in audio or video transcriptions.

3.7 – `c_scale`

Type	str; optional, default c_scale = ‘’
Description	In case the user wants to identify the speaker (clinician vs. participant) in the transcribed `json`, they can enter the name of the clinical scale that the audio file is capturing as an optional parameter. The function currently only supports `'panss`’.

3.8 – `access_key`

Type	str; optional, default access_key = ‘’
Description	A unique credential used to authenticate and authorize access to AWS (Amazon Web Services) resources and services. In case the user is running from their AWS account (e.g., from within their EC2 instance), they do not need to enter this. Only enter if running from local machine.

3.9 – `secret_key`

Type	str; optional, default secret_key = ‘’
Description	A unique credential used to authenticate and authorize access to AWS (Amazon Web Services) resources and services. In case the user is running from their AWS account (e.g., from within their EC2 instance), they do not need to enter this. Only enter if running from local machine.

4 – Outputs

4.1 – `json`

Type	json
Description	`json` output that lists each word transcribe, the confidence level associated with that word’s transcription, its utterance start time,end time and labeled speaker for those time point

4.2 – `transcript`

Type	str
Description	string with entire transcription and no punctuation

5 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency	License	Justification
SentenceTransformer	Apache 2.0	A pre-trained BERT-based sentence transformer model to compute the similarity between the speech.

OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech transcription cloud v1.0

1 – Use

2 – Methods

2.1 – Speaker separation

2.2 – Speaker identification

3 – Inputs

3.1 – `filepath`

3.2 – `language`

3.3 – `region`

3.4 – `job_name`

3.5 – `ShowSpeakerLabels`

3.6 – `MaxSpeakerLabels`

3.7 – `c_scale`

3.8 – `access_key`

3.9 – `secret_key`

4 – Outputs

4.1 – `json`

4.2 – `transcript`

5 – Dependencies

Table of contents

Clone this wiki locally

Speech transcription cloud v1.0

1 – Use

2 – Methods

2.1 – Speaker separation

2.2 – Speaker identification

3 – Inputs

3.1 – filepath

3.2 – language

3.3 – region

3.4 – job_name

3.5 – ShowSpeakerLabels

3.6 – MaxSpeakerLabels

3.7 – c_scale

3.8 – access_key

3.9 – secret_key

4 – Outputs

4.1 – json

4.2 – transcript

5 – Dependencies

Table of contents

Clone this wiki locally

3.1 – `filepath`

3.2 – `language`

3.3 – `region`

3.4 – `job_name`

3.5 – `ShowSpeakerLabels`

3.6 – `MaxSpeakerLabels`

3.7 – `c_scale`

3.8 – `access_key`

3.9 – `secret_key`

4.1 – `json`

4.2 – `transcript`