-
Notifications
You must be signed in to change notification settings - Fork 7
Speech transcription cloud v1.1
Date completed | September 26th, 2023 |
Release where first appeared | OpenWillis v1.4 |
Researcher / Developer | Vijay Yadav |
import openwillis as ow
json, transcript = ow.speech_transcription_cloud(filepath = 's3://bucket/file.wav', model = 'aws', language = 'en-US', region = 'us-east-1', job_name = 'test', ShowSpeakerLabels = True, MaxSpeakerLabels = 2, access_key = 'xxxx', secret_key = 'xxxx', c_scale = 'panss')
This function uses Amazon Transcribe to transcribe speech into text, label each speaker, and––optionally in the case of clinical interview recordings––identify each speaker as either the clinician or the participant.
Compared to the locally-running Speech Transcription function, the cloud-based version can support multiple speakers and is arguably more accurate. However, the user is expected to have AWS configured on their system and input access and secret keys that will allow the function to utilize their AWS account. To learn about how to set up AWS, check out their documentation.
The function has two outputs. The json
output contains start times, end times, transcription confidence scores, and (in case ShowSpeakerLabels
is true) speaker labels. The json
also serves as an input to both the Speaker Separation Cloud and Speech Characteristics functions. The transcript
output is simply a string of the entire transcription.
In the case of two speakers, each word in the output json
is by default labeled either speaker0
or speaker1
. This is done automatically by AWS. Using these labels, the user can process only the speech of a specific speaker for speech characteristics using the Speech Characteristics function, which allows for filtering of the input json
by speaker label before further processing.
In case the audio file is of a structured clinical interview, the user can input the name of the clinical scale in the c_scale
argument and the function will use what it knows about the scale to automatically label the clinician
and the participant
instead of the speaker0
and speaker1
labels. The two scales currently supported are PANSS (c_scale = 'panss'
) and MADRS (c_scale = 'madrs'
).
Type | str |
Description | URI path to audio file on S3 bucket; include all files format supported by AWS |
Type | default is ‘en-US ’ i.e. US English;
|
Description | Amazon transcribe language codes: https://docs.aws.amazon.com/transcribe/latest/dg/supported-languages.html. |
Type | default is ‘us-east-1 ’
|
Description | aws region code |
Type | default is ‘transcribe_job_01 ’
|
Description | name of the transcription job, required by aws |
Type | Boolean; optional, default is True |
Description | Identifies and labels different speakers in audio or video transcriptions |
Type | Boolean; optional, default is 2 |
Description | This parameter sets the maximum number of speakers to be identified and labeled in audio or video transcriptions. |
Type | str; optional, default c_scale = ‘’ |
Description | In case the user wants to identify the speaker (clinician vs. participant) in the transcribed json , they can enter the name of the clinical scale that the audio file is capturing as an optional parameter. The function currently only supports 'panss ’.
|
Type | str; optional, default access_key = ‘’ |
Description | A unique credential used to authenticate and authorize access to AWS (Amazon Web Services) resources and services. In case the user is running from their AWS account (e.g., from within their EC2 instance), they do not need to enter this. Only enter if running from local machine. |
Type | str; optional, default secret_key = ‘’ |
Description | A unique credential used to authenticate and authorize access to AWS (Amazon Web Services) resources and services. In case the user is running from their AWS account (e.g., from within their EC2 instance), they do not need to enter this. Only enter if running from local machine. |
Type | json |
Description |
json output that lists each word transcribe, the confidence level associated with that word’s transcription, its utterance start time,end time and labeled speaker for those time point
|
Type | str |
Description | string with entire transcription and no punctuation |
Below are dependencies specific to calculation of this measure.
Dependency | License | Justification |
SentenceTransformer | Apache 2.0 | A pre-trained BERT-based sentence transformer model to compute the similarity between the speech. |
OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.
- Release notes
- Getting started
-
List of functions
- Facial Expressivity v2.0
- Emotional Expressivity v2.0
- Eye Blink Rate v1.0
- Speech Transcription with Vosk v1.0
- Speech Transcription with Whisper v1.1
- Speech Transcription with AWS v1.1
- Speaker Separation with Labels v1.0
- Speaker Separation without Labels v1.0
- Vocal Acoustics v2.0
- Speech Characteristics v3.0
- GPS Analysis v1.0
- Research guidelines