-
Notifications
You must be signed in to change notification settings - Fork 7
Speech characteristics v2.1
Date completed | October 5, 2023 |
Release where first appeared | v1.5 |
Researcher / Developer | Georgios Efstathiadis, Vijay Yadav |
import openwillis as ow
words, phrases, turns, summary = ow.speech_characteristics(json_conf = '', language = '', speaker_label = '')
This function measures language characteristics from a transcript of an individual’s speech. The transcript inputted is the JSON output from the Speech Transcription function.
This version of the function only supports JSON files acquired through the Vosk or WhisperX models from the Speech Transcription function.
By default, the function assumes the transcript contains speech from one speaker. Consequently, all variables below are calculated from the entire transcript and the per-turn measures are not populated.
In the case of multiple speakers labeled in the JSON transcript, the user must define which speaker they want to quantify speech characteristics from by using the speaker_label
argument. Only when a speaker is specified are per-turn measures calculated.
If a language other than English (language = 'en'
) is specified, a subset of the variables below dependent on pre-trained models that are limited to the English language will not be calculated.
The function’s first output is a words
dataframe, which contains characteristics specific to each word spoken by the speaker. This includes:
-
Pause time before the word in seconds (
pre_word_pause
)
Individual word timestamps in the input JSON are used to calculate pause lengths before each word. To avoid measurement of potential long silences prior to the start of speech in an audio file, thepre_word_pause
for the first word in every file is set toNaN
. To distinguishpre_word_pause
frompre_phrase_pause
andpre_turn_pause
as defined later in this document,pre_word_pause
for the first word in each phrase and turn is also set toNaN
. \ -
Number of syllables, identified using NLTK’s SyllableTokenizer (
num_syllables
) \ -
Part of speech associated with the word, identified using NLTK, as specified in the
part_of_speech
column:noun
verb
adjective
pronoun \
-
Emotional valence associate with each word is calculated using the vaderSentiment library:
- Degree of positive valence ranging from 0-1 (
sentiment_pos
) - Degree of negative valence ranging from 0-1 (
sentiment_neg
) - Degree of neutral valence ranging from 0-1 (
sentiment_neu
) - Degree of overall valence ranging from 0-1 (
sentiment_overall
)
- Degree of positive valence ranging from 0-1 (
In case the input JSON file also identifies phrases (which is indeed the case if the user employed the speech_transcription_cloud
function and used Amazon Transcribe to conduct the transcription), then the function will populate output phrase-level characteristics in a** phrases
** dataframe. This includes:
-
Pause time before the phrase in seconds (
pre_phrase_pause
)
Thepre_phrase_pause
for phrases at the beginning of a file and at the beginning of a turn is specified to NaN to remove the effect of long silences at the beginning of audio files and distinguish this variable frompre_turn_pause
respectively. \ -
Total length of the phrase in minutes (
phrase_length_minutes
) -
Total number of words in the phrase (
phrase_length_words
) -
Rate of speech in words per minute (
words_per_min
) -
Articulation rate, i.e., syllables per minute (
syllables_per_min
) -
Pause rate in pauses per minute (
pauses_per_min
) -
Pause variability in seconds (
pause_variability
) -
Mean pause time between words in seconds (
mean_pause_length
) -
Speech percentage i.e., time spoken over total time (
speech_percentage
) -
Parts of speech used in the entire phrase, with a column each for:
noun_percentage
verb_percentage
adjective_percentage
pronoun_percentage
-
Emotional valence associated with the phrase:
- Degree of positive valence ranging from 0-1 (
sentiment_pos
) - Degree of negative valence ranging from 0-1 (
sentiment_neg
) - Degree of neutral valence ranging from 0-1 (
sentiment_neu
) - Degree of overall valence ranging from 0-1 (
sentiment_overall
)
- Degree of positive valence ranging from 0-1 (
-
Lexical diversity as measured by the moving average type token ratio (MATTR) score, calculated using the LexicalRichness library (
mattr
)
A turn is a single speech segment by a speaker.
For a single-speaker file, a turn covers the entire speech and hence this output will be populated with NaNs and can be disregarded.
When there are multiple speakers marked in the JSON, a turn consists of everything a speaker says before the next one starts. Therefore, a one-speaker JSON has a single turn, while a multi-speaker JSON has multiple, alternating turns. Accordingly, this function provides a turns
dataframe, detailing turn-level speech and language characteristics. This includes:
-
Pause time before the turn in seconds (
pre_turn_pause
)
This is set to NaN for turns at the beginning of the audio file to remove the effect of potential long silences at the beginning of an audio file. \ -
Total length of the turn in minutes (
turn_length_minutes
) -
Total number of words in the turn (
turn_length_words
) -
Rate of speech in words per minute (
words_per_min
) -
Articulation rate, i.e., syllables per minute (
syllables_per_min
) -
Pause rate in pauses per minute (
pauses_per_min
) -
Pause variability in seconds (
pause_variability
) -
Mean pause time between words in seconds (
mean_pause_length
) -
Speech percentage i.e., time spoken over total time (
speech_percentage
) -
Parts of speech used in the entire turn, with a column each for:
noun_percentage
verb_percentage
adjective_percentage
pronoun_percentage
-
Emotional valence associated with the phrase:
- Degree of positive valence ranging from 0-1 (
sentiment_pos
) - Degree of negative valence ranging from 0-1 (
sentiment_neg
) - Degree of neutral valence ranging from 0-1 (
sentiment_neu
) - Degree of overall valence ranging from 0-1 (
sentiment_overall
)
- Degree of positive valence ranging from 0-1 (
-
Lexical diversity as measured by the moving average type token ratio (MATTR) score (
mattr
) -
Flag for negative pre-turn pause, i.e. interrupt (
interrupt_flag
) \Set to
True
if pre-turn pause is negative otherwise it is set toFalse
.
Finally, the function outputs a summary
dataframe, which compiles file-level information.
- Total length of speech in minutes (
speech_length_minutes
) - Total number of words spoken (
speech_length_words
) - Rate of speech in words per minute (
words_per_min
) - Articulation rate i.e. syllables per minute (
syllables_per_min
) - Pause rate in pauses per minute (
pauses_per_min
) - Pre-word pause mean in seconds (
word_pause_length_mean
) - Pre-word variability in seconds (
word_pause_variability
) - Pre-phrase pause mean in seconds (
phrase_pause_length_mean
) - Pre-phrase variability in seconds (
phrase_pause_variability
) - Speech percentage i.e., time spoken over total time (
speech_percentage
) - Parts of speech used in the entire file, with a column each for:
noun_percentage
verb_percentage
adjective_percentage
pronoun_percentage
- Emotional valence associated with the speech:
- Degree of positive valence ranging from 0-1 (
sentiment_pos
) - Degree of negative valence ranging from 0-1 (
sentiment_neg
) - Degree of neutral valence ranging from 0-1 (
sentiment_neu
) - Degree of overall valence ranging from 0-1 (
sentiment_overall
)
- Degree of positive valence ranging from 0-1 (
- Lexical diversity as measured by the moving average type token ratio (MATTR) score (
mattr
)
In case the source JSON contains more than one speaker, we assume each speaker will have more than one turn. Hence, the following turn summary statistics will also be included (otherwise set to NaN):
- Total number of turns (
num_turns
) - Mean length of turns in minutes (
mean_turn_length_minutes
) - Mean length of turns in words spoken (
mean_turn_length_words
) - Mean pause time before each turn (
mean_pre_turn_pause
) - Number of one-word turns (
num_one_word_turns
) - Number of interrupts, i.e. negative pre-turn pauses (
num_interrupts
)
The sum of interrupt flags from theturns
summary.
Type | json |
Description | output from speech transcription function |
Type | str, optional, default = ‘en’ |
Description | the language for which speech characteristics will be calculated. if the language is english, all shown variables will be calculated. if the language is not english, only language-independent variables will be calculated (e.g., pause characteristics). |
Type | str, optional, default = None |
Description | the speaker label from the JSON file for which the speech characteristics are calculated |
Type |
pandas.DataFrame
|
Description |
summary of speech characteristics from inputted json file (word-level)
|
What a transpose of the data frame looks like:
pre_word_pause
|
← duration of pre-word pause in seconds |
no_syllables
|
← number of syllables |
part_of_speech
|
← whether the word is noun, verb, adjective or pronoun |
sentiment_neg
|
← negative sentiment of speech |
sentiment_neu
|
← neutral sentiment of speech |
sentiment_pos
|
← positive sentiment of speech |
sentiment_overall
|
← overall sentiment of speech |
Type | pandas.DataFrame |
Description | summary of speech characteristics from inputted json file (phrase-level) |
What a transpose of the data frame looks like:
pre_phrase_pause
|
← duration of pre-phrase pause in seconds |
phrase_length_minutes
|
← total time of phrase |
phrase_length_words
|
← total words spoken |
words_per_min
|
← rate of speech in words per minute |
syllables_per_min
|
← articulation rate i.e. syllables per minute |
pauses_per_min
|
← pause rate i.e. number of pauses per minute |
pause_variability
|
← pause variability in seconds |
mean_pause_length
|
← mean pause time between words in seconds |
speech_percentage
|
← speech percentage i.e., time spoken over total time |
noun_percentage
|
← noun percentage |
verb_percentage
|
← verb percentage |
adj_percentage
|
← adjective percentage |
pronoun_percentage
|
← pronoun percentage |
sentiment_neg
|
← negative sentiment of speech |
sentiment_neu
|
← neutral sentiment of speech |
sentiment_pos
|
← positive sentiment of speech |
sentiment_overall
|
← overall sentiment of speech |
mattr
|
← moving average type token ratio |
Type | pandas.DataFrame or None |
Description | summary of speech characteristics from inputted json file (turn-level), only populated in case of multiple speakers |
What a transpose of the data frame looks like:
pre_turn_pause
|
← duration of pre-turn pause in seconds |
turn_length_minutes
|
← total time of turn in minutes |
turn_length_words
|
← total words spoken |
words_per_min
|
← rate of speech in words per minute |
syllables_per_min
|
← articulation rate i.e. syllables per minute |
pauses_per_min
|
← pause rate i.e. number of pauses per minute |
pause_variability
|
← pause variability in seconds |
mean_pause_length
|
← mean pause time between words in seconds |
speech_percentage
|
← speech percentage i.e., time spoken over total time |
noun_percentage
|
← noun percentage |
verb_percentage
|
← verb percentage |
adj_percentage
|
← adjective percentage |
pronoun_percentage
|
← pronoun percentage |
sentiment_pos
|
← positive sentiment of speech |
sentiment_neg
|
← negative sentiment of speech |
sentiment_neu
|
← neutral sentiment of speech |
sentiment_overall
|
← overall sentiment of speech |
mattr
|
← moving average type token ratio |
interrupt_flag
|
← Flag for negative pre-turn pause, i.e. interrupt (interrupt_flag) |
Type | pandas.DataFrame |
Description | summary of speech characteristics from inputted json file (file-level) |
What a transpose of the data frame looks like:
speech_length_minutes
|
← total length of speech in minutes |
speech_length_words
|
← total words spoken |
words_per_min
|
← rate of speech in words per minute |
syllables_per_min
|
← articulation rate i.e. syllables per minute |
pauses_per_min
|
← pauses per minute |
word_pause_length_mean
|
← mean duration of pre-word pauses in seconds |
word_pause_variability
|
← Variability of pre-word pauses in seconds |
phrase_pause_length_mean
|
← mean duration of pre-phrase pauses in seconds |
phrase_pause_variability
|
← Variability of pre-phrase pauses in seconds |
speech_percentage
|
← speech percentage i.e., time spoken over total time |
noun_percentage
|
← noun percentage |
verb_percentage
|
← verb percentage |
adj_percentage
|
← adjective percentage |
pronoun_percentage
|
← pronoun percentage |
sentiment_pos
|
← positive sentiment of speech |
sentiment_neg
|
← negative sentiment of speech |
sentiment_neu
|
← neutral sentiment of speech |
sentiment_overall
|
← overall sentiment of speech |
mattr
|
← moving average type token ratio |
num_turns
|
← total number of turns |
mean_turn_length_minutes
|
← mean length of turns in minutes |
mean_turn_length_words
|
← mean length of turns in words spoken |
turn_pause_length_mean
|
← mean pause time before each turn |
num_one_word_turns
|
← number of one-word turns |
num_interrupts
|
← number of interrupts, i.e. negative pre-turn pauses |
Below are dependencies specific to calculation of this measure.
Dependency | License | Justification |
NLTK | Apache 2.0 | Well-established library for commonly measured natural language characteristics |
LexicalRichness | MIT | Straightforward implementation of methods for calculation of MATTR score |
vaderSentiment | MIT | Widely used library for sentiment analysis, trained on a large and heterogeneous dataset |
OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.
- Release notes
- Getting started
-
List of functions
- Facial Expressivity v2.0
- Emotional Expressivity v2.0
- Eye Blink Rate v1.0
- Speech Transcription with Vosk v1.0
- Speech Transcription with Whisper v1.1
- Speech Transcription with AWS v1.1
- Speaker Separation with Labels v1.0
- Speaker Separation without Labels v1.0
- Vocal Acoustics v2.0
- Speech Characteristics v3.0
- GPS Analysis v1.0
- Research guidelines