Skip to content

Speaker separation cloud v1.0

anzar edited this page Aug 17, 2023 · 2 revisions
Date completed June 21, 2023
Release where first appeared OpenWillis v1.3
Researcher / Developer Vijay Yadav

1 – Use

1.1 – Processing

The json input is coming from the cloud-based speech_transcription_cloud function:

import openwillis as ow

signal_label = ow.speaker_separation_cloud(filepath = 'data.wav', json_response = '{}')

1.2 – Saving audio files

To use the signal_label output to save the separated audio files:

import openwillis as ow

ow.to_audio(filepath = 'data.wav', signal_label, out_dir)

2 – Methods

The speaker_separation function separates an audio file with two speakers into two audio signals, each containing speech only from one of the two speakers. The to_audio function is used to save that audio signal as an audio file.

2.1 – Following use of the speech_transcription_cloud function

The assumption is that the user has already used the speech_transcription_cloud function to acquire a JSON transcript with labeled speakers.For a more accurate method using a This function uses the timepoints in the JSON to slice the audio file. Consequently, the function returns an audio signal dictionary with keys representing the labels 'speaker0' or 'speaker1' or 'clinician' and 'participant' and the values holding the audio signals in numpy array format.

2.3 – Following use of the to_audio function

The to_audio function exports audio signals from a dictionary to individual WAV files. It takes a file path, a dictionary containing labeled speakers and their respective audio signals as numpy arrays, and an output directory. Each WAV file (for both speakers) is saved in the specified output directory with a unique name in the format of "filename_speakerlabel.wav", where "filename" refers to the original file name, and "speakerlabel" represents the label of the speaker.


3 – Inputs

3.1 – filepath

Type str
Description path to audio file to be separated

3.2 – json_response

Type json
Description json output that lists each word transcribe, the confidence level associated with that word’s transcription, its utterance start time, and its utterance end time.

4 – Outputs

4.1 – signal_label

Type dictionary
Description A dictionary with the speaker label as the key and the audio signal numpy array as the value.

What the dictionary looks like:

labels = {'speaker0': [2,35,56,-52 … 13, -14], 'speaker1': [12,45,26,-12 ……43, -54] }


5 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency License Justification
Clone this wiki locally