Speech characteristics v2.1

Date completed	October 5, 2023
Release where first appeared	v1.5
Researcher / Developer	Georgios Efstathiadis, Vijay Yadav

1 – Use

import openwillis as ow

words, phrases, turns, summary = ow.speech_characteristics(json_conf = '', language = '', speaker_label = '')

2 – Methods

This function measures language characteristics from a transcript of an individual’s speech. The transcript inputted is the JSON output from the Speech Transcription function.

This version of the function only supports JSON files acquired through the Vosk or WhisperX models from the Speech Transcription function.

By default, the function assumes the transcript contains speech from one speaker. Consequently, all variables below are calculated from the entire transcript and the per-turn measures are not populated.

In the case of multiple speakers labeled in the JSON transcript, the user must define which speaker they want to quantify speech characteristics from by using the speaker_label argument. Only when a speaker is specified are per-turn measures calculated.

If a language other than English (language = 'en') is specified, a subset of the variables below dependent on pre-trained models that are limited to the English language will not be calculated.

2.1 – Per-word measures

The function’s first output is a words dataframe, which contains characteristics specific to each word spoken by the speaker. This includes:

Pause time before the word in seconds (pre_word_pause)

Individual word timestamps in the input JSON are used to calculate pause lengths before each word. To avoid measurement of potential long silences prior to the start of speech in an audio file, the pre_word_pause for the first word in every file is set to NaN. To distinguish pre_word_pause from pre_phrase_pause and pre_turn_pause as defined later in this document, pre_word_pause for the first word in each phrase and turn is also set to NaN. \
Number of syllables, identified using NLTK’s SyllableTokenizer (num_syllables) \
Part of speech associated with the word, identified using NLTK, as specified in the part_of_speech column:
- noun
- verb
- adjective
- pronoun \
Emotional valence associate with each word is calculated using the vaderSentiment library:
- Degree of positive valence ranging from 0-1 (sentiment_pos)
- Degree of negative valence ranging from 0-1 (sentiment_neg)
- Degree of neutral valence ranging from 0-1 (sentiment_neu)
- Degree of overall valence ranging from 0-1 (sentiment_overall)

2.2 - Per-phrase measures

In case the input JSON file also identifies phrases (which is indeed the case if the user employed the speech_transcription_cloud function and used Amazon Transcribe to conduct the transcription), then the function will populate output phrase-level characteristics in a** phrases** dataframe. This includes:

Pause time before the phrase in seconds (pre_phrase_pause)

The pre_phrase_pause for phrases at the beginning of a file and at the beginning of a turn is specified to NaN to remove the effect of long silences at the beginning of audio files and distinguish this variable from pre_turn_pause respectively. \
Total length of the phrase in minutes (phrase_length_minutes)
Total number of words in the phrase (phrase_length_words)
Rate of speech in words per minute (words_per_min)
Articulation rate, i.e., syllables per minute (syllables_per_min)
Pause rate in pauses per minute (pauses_per_min)
Pause variability in seconds (pause_variability)
Mean pause time between words in seconds (mean_pause_length)
Speech percentage i.e., time spoken over total time (speech_percentage)
Parts of speech used in the entire phrase, with a column each for:
- noun_percentage
- verb_percentage
- adjective_percentage
- pronoun_percentage
Emotional valence associated with the phrase:
- Degree of positive valence ranging from 0-1 (sentiment_pos)
- Degree of negative valence ranging from 0-1 (sentiment_neg)
- Degree of neutral valence ranging from 0-1 (sentiment_neu)
- Degree of overall valence ranging from 0-1 (sentiment_overall)
Lexical diversity as measured by the moving average type token ratio (MATTR) score, calculated using the LexicalRichness library (mattr)

2.3 – Per-turn measures

A turn is a single speech segment by a speaker.

For a single-speaker file, a turn covers the entire speech and hence this output will be populated with NaNs and can be disregarded.

When there are multiple speakers marked in the JSON, a turn consists of everything a speaker says before the next one starts. Therefore, a one-speaker JSON has a single turn, while a multi-speaker JSON has multiple, alternating turns. Accordingly, this function provides a turns dataframe, detailing turn-level speech and language characteristics. This includes:

Pause time before the turn in seconds (pre_turn_pause)

This is set to NaN for turns at the beginning of the audio file to remove the effect of potential long silences at the beginning of an audio file. \
Total length of the turn in minutes (turn_length_minutes)
Total number of words in the turn (turn_length_words)
Rate of speech in words per minute (words_per_min)
Articulation rate, i.e., syllables per minute (syllables_per_min)
Pause rate in pauses per minute (pauses_per_min)
Pause variability in seconds (pause_variability)
Mean pause time between words in seconds (mean_pause_length)
Speech percentage i.e., time spoken over total time (speech_percentage)
Parts of speech used in the entire turn, with a column each for:
- noun_percentage
- verb_percentage
- adjective_percentage
- pronoun_percentage
Emotional valence associated with the phrase:
- Degree of positive valence ranging from 0-1 (sentiment_pos)
- Degree of negative valence ranging from 0-1 (sentiment_neg)
- Degree of neutral valence ranging from 0-1 (sentiment_neu)
- Degree of overall valence ranging from 0-1 (sentiment_overall)
Lexical diversity as measured by the moving average type token ratio (MATTR) score (mattr)
Flag for negative pre-turn pause, i.e. interrupt (interrupt_flag) \

Set to True if pre-turn pause is negative otherwise it is set to False.

2.4 – Summary

Finally, the function outputs a summary dataframe, which compiles file-level information.

Total length of speech in minutes (speech_length_minutes)
Total number of words spoken (speech_length_words)
Rate of speech in words per minute (words_per_min)
Articulation rate i.e. syllables per minute (syllables_per_min)
Pause rate in pauses per minute (pauses_per_min)
Pre-word pause mean in seconds (word_pause_length_mean)
Pre-word variability in seconds (word_pause_variability)
Pre-phrase pause mean in seconds (phrase_pause_length_mean)
Pre-phrase variability in seconds (phrase_pause_variability)
Speech percentage i.e., time spoken over total time (speech_percentage)
Parts of speech used in the entire file, with a column each for:
- noun_percentage
- verb_percentage
- adjective_percentage
- pronoun_percentage
Emotional valence associated with the speech:
- Degree of positive valence ranging from 0-1 (sentiment_pos)
- Degree of negative valence ranging from 0-1 (sentiment_neg)
- Degree of neutral valence ranging from 0-1 (sentiment_neu)
- Degree of overall valence ranging from 0-1 (sentiment_overall)
Lexical diversity as measured by the moving average type token ratio (MATTR) score (mattr)

In case the source JSON contains more than one speaker, we assume each speaker will have more than one turn. Hence, the following turn summary statistics will also be included (otherwise set to NaN):

Total number of turns (num_turns)
Mean length of turns in minutes (mean_turn_length_minutes)
Mean length of turns in words spoken (mean_turn_length_words)
Mean pause time before each turn (mean_pre_turn_pause)
Number of one-word turns (num_one_word_turns)
Number of interrupts, i.e. negative pre-turn pauses (num_interrupts)

The sum of interrupt flags from the turns summary.

3 – Inputs

3.1 – `transcript_json`

Type	json
Description	output from speech transcription function

3.2 – `language`

Type	str, optional, default = ‘en’
Description	the language for which speech characteristics will be calculated. if the language is english, all shown variables will be calculated. if the language is not english, only language-independent variables will be calculated (e.g., pause characteristics).

3.3 – `speaker_label`

Type	str, optional, default = None
Description	the speaker label from the JSON file for which the speech characteristics are calculated

4 – Outputs

4.1 – words

Type	`pandas.DataFrame`
Description	`summary of speech characteristics from inputted json file (word-level)`

What a transpose of the data frame looks like:

`pre_word_pause`	← duration of pre-word pause in seconds
`no_syllables`	← number of syllables
`part_of_speech`	← whether the word is noun, verb, adjective or pronoun
`sentiment_neg`	← negative sentiment of speech
`sentiment_neu`	← neutral sentiment of speech
`sentiment_pos`	← positive sentiment of speech
`sentiment_overall`	← overall sentiment of speech

4.2 – phrases

Type	pandas.DataFrame
Description	summary of speech characteristics from inputted json file (phrase-level)

What a transpose of the data frame looks like:

`pre_phrase_pause`	← duration of pre-phrase pause in seconds
`phrase_length_minutes`	← total time of phrase
`phrase_length_words`	← total words spoken
`words_per_min`	← rate of speech in words per minute
`syllables_per_min`	← articulation rate i.e. syllables per minute
`pauses_per_min`	← pause rate i.e. number of pauses per minute
`pause_variability`	← pause variability in seconds
`mean_pause_length`	← mean pause time between words in seconds
`speech_percentage`	← speech percentage i.e., time spoken over total time
`noun_percentage`	← noun percentage
`verb_percentage`	← verb percentage
`adj_percentage`	← adjective percentage
`pronoun_percentage`	← pronoun percentage
`sentiment_neg`	← negative sentiment of speech
`sentiment_neu`	← neutral sentiment of speech
`sentiment_pos`	← positive sentiment of speech
`sentiment_overall`	← overall sentiment of speech
`mattr`	← moving average type token ratio

4.3 – `turns`

Type	pandas.DataFrame or None
Description	summary of speech characteristics from inputted json file (turn-level), only populated in case of multiple speakers

What a transpose of the data frame looks like:

`pre_turn_pause`	← duration of pre-turn pause in seconds
`turn_length_minutes`	← total time of turn in minutes
`turn_length_words`	← total words spoken
`words_per_min`	← rate of speech in words per minute
`syllables_per_min`	← articulation rate i.e. syllables per minute
`pauses_per_min`	← pause rate i.e. number of pauses per minute
`pause_variability`	← pause variability in seconds
`mean_pause_length`	← mean pause time between words in seconds
`speech_percentage`	← speech percentage i.e., time spoken over total time
`noun_percentage`	← noun percentage
`verb_percentage`	← verb percentage
`adj_percentage`	← adjective percentage
`pronoun_percentage`	← pronoun percentage
`sentiment_pos`	← positive sentiment of speech
`sentiment_neg`	← negative sentiment of speech
`sentiment_neu`	← neutral sentiment of speech
`sentiment_overall`	← overall sentiment of speech
`mattr`	← moving average type token ratio
`interrupt_flag`	← Flag for negative pre-turn pause, i.e. interrupt (interrupt_flag)

4.4 – `summary`

Type	pandas.DataFrame
Description	summary of speech characteristics from inputted json file (file-level)

What a transpose of the data frame looks like:

`speech_length_minutes`	← total length of speech in minutes
`speech_length_words`	← total words spoken
`words_per_min`	← rate of speech in words per minute
`syllables_per_min`	← articulation rate i.e. syllables per minute
`pauses_per_min`	← pauses per minute
`word_pause_length_mean`	← mean duration of pre-word pauses in seconds
`word_pause_variability`	← Variability of pre-word pauses in seconds
`phrase_pause_length_mean`	← mean duration of pre-phrase pauses in seconds
`phrase_pause_variability`	← Variability of pre-phrase pauses in seconds
`speech_percentage`	← speech percentage i.e., time spoken over total time
`noun_percentage`	← noun percentage
`verb_percentage`	← verb percentage
`adj_percentage`	← adjective percentage
`pronoun_percentage`	← pronoun percentage
`sentiment_pos`	← positive sentiment of speech
`sentiment_neg`	← negative sentiment of speech
`sentiment_neu`	← neutral sentiment of speech
`sentiment_overall`	← overall sentiment of speech
`mattr`	← moving average type token ratio
`num_turns`	← total number of turns
`mean_turn_length_minutes`	← mean length of turns in minutes
`mean_turn_length_words`	← mean length of turns in words spoken
`turn_pause_length_mean`	← mean pause time before each turn
`num_one_word_turns`	← number of one-word turns
`num_interrupts`	← number of interrupts, i.e. negative pre-turn pauses

5 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency	License	Justification
NLTK	Apache 2.0	Well-established library for commonly measured natural language characteristics
LexicalRichness	MIT	Straightforward implementation of methods for calculation of MATTR score
vaderSentiment	MIT	Widely used library for sentiment analysis, trained on a large and heterogeneous dataset

OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech characteristics v2.1

1 – Use

2 – Methods

2.1 – Per-word measures

2.2 - Per-phrase measures

2.3 – Per-turn measures

2.4 – Summary

3 – Inputs

3.1 – `transcript_json`

3.2 – `language`

3.3 – `speaker_label`

4 – Outputs

4.1 – words

4.2 – phrases

4.3 – `turns`

4.4 – `summary`

5 – Dependencies

Table of contents

Clone this wiki locally

Speech characteristics v2.1

1 – Use

2 – Methods

2.1 – Per-word measures

2.2 - Per-phrase measures

2.3 – Per-turn measures

2.4 – Summary

3 – Inputs

3.1 – transcript_json

3.2 – language

3.3 – speaker_label

4 – Outputs

4.1 – words

4.2 – phrases

4.3 – turns

4.4 – summary

5 – Dependencies

Table of contents

Clone this wiki locally

3.1 – `transcript_json`

3.2 – `language`

3.3 – `speaker_label`

4.3 – `turns`

4.4 – `summary`