Skip to content

Speech characteristics v2.1

anzar edited this page Oct 5, 2023 · 2 revisions
Date completed October 5, 2023
Release where first appeared v1.5
Researcher / Developer Georgios Efstathiadis, Vijay Yadav

1 – Use

import openwillis as ow

words, phrases, turns, summary = ow.speech_characteristics(json_conf = '', language = '', speaker_label = '')

2 – Methods

This function measures language characteristics from a transcript of an individual’s speech. The transcript inputted is the JSON output from the Speech Transcription function.

This version of the function only supports JSON files acquired through the Vosk or WhisperX models from the Speech Transcription function.

By default, the function assumes the transcript contains speech from one speaker. Consequently, all variables below are calculated from the entire transcript and the per-turn measures are not populated.

In the case of multiple speakers labeled in the JSON transcript, the user must define which speaker they want to quantify speech characteristics from by using the speaker_label argument. Only when a speaker is specified are per-turn measures calculated.

If a language other than English (language = 'en') is specified, a subset of the variables below dependent on pre-trained models that are limited to the English language will not be calculated.

2.1 – Per-word measures

The function’s first output is a words dataframe, which contains characteristics specific to each word spoken by the speaker. This includes:

  • Pause time before the word in seconds (pre_word_pause)

    Individual word timestamps in the input JSON are used to calculate pause lengths before each word. To avoid measurement of potential long silences prior to the start of speech in an audio file, the pre_word_pause for the first word in every file is set to NaN. To distinguish pre_word_pause from pre_phrase_pause and pre_turn_pause as defined later in this document, pre_word_pause for the first word in each phrase and turn is also set to NaN. \

  • Number of syllables, identified using NLTK’s SyllableTokenizer (num_syllables) \

  • Part of speech associated with the word, identified using NLTK, as specified in the part_of_speech column:

    • noun
    • verb
    • adjective
    • pronoun \
  • Emotional valence associate with each word is calculated using the vaderSentiment library:

    • Degree of positive valence ranging from 0-1 (sentiment_pos)
    • Degree of negative valence ranging from 0-1 (sentiment_neg)
    • Degree of neutral valence ranging from 0-1 (sentiment_neu)
    • Degree of overall valence ranging from 0-1 (sentiment_overall)

2.2 - Per-phrase measures

In case the input JSON file also identifies phrases (which is indeed the case if the user employed the speech_transcription_cloud function and used Amazon Transcribe to conduct the transcription), then the function will populate output phrase-level characteristics in a** phrases** dataframe. This includes:

  • Pause time before the phrase in seconds (pre_phrase_pause)

    The pre_phrase_pause for phrases at the beginning of a file and at the beginning of a turn is specified to NaN to remove the effect of long silences at the beginning of audio files and distinguish this variable from pre_turn_pause respectively. \

  • Total length of the phrase in minutes (phrase_length_minutes)

  • Total number of words in the phrase (phrase_length_words)

  • Rate of speech in words per minute (words_per_min)

  • Articulation rate, i.e., syllables per minute (syllables_per_min)

  • Pause rate in pauses per minute (pauses_per_min)

  • Pause variability in seconds (pause_variability)

  • Mean pause time between words in seconds (mean_pause_length)

  • Speech percentage i.e., time spoken over total time (speech_percentage)

  • Parts of speech used in the entire phrase, with a column each for:

    • noun_percentage
    • verb_percentage
    • adjective_percentage
    • pronoun_percentage
  • Emotional valence associated with the phrase:

    • Degree of positive valence ranging from 0-1 (sentiment_pos)
    • Degree of negative valence ranging from 0-1 (sentiment_neg)
    • Degree of neutral valence ranging from 0-1 (sentiment_neu)
    • Degree of overall valence ranging from 0-1 (sentiment_overall)
  • Lexical diversity as measured by the moving average type token ratio (MATTR) score, calculated using the LexicalRichness library (mattr)

2.3 – Per-turn measures

A turn is a single speech segment by a speaker.

For a single-speaker file, a turn covers the entire speech and hence this output will be populated with NaNs and can be disregarded.

When there are multiple speakers marked in the JSON, a turn consists of everything a speaker says before the next one starts. Therefore, a one-speaker JSON has a single turn, while a multi-speaker JSON has multiple, alternating turns. Accordingly, this function provides a turns dataframe, detailing turn-level speech and language characteristics. This includes:

  • Pause time before the turn in seconds (pre_turn_pause)

    This is set to NaN for turns at the beginning of the audio file to remove the effect of potential long silences at the beginning of an audio file. \

  • Total length of the turn in minutes (turn_length_minutes)

  • Total number of words in the turn (turn_length_words)

  • Rate of speech in words per minute (words_per_min)

  • Articulation rate, i.e., syllables per minute (syllables_per_min)

  • Pause rate in pauses per minute (pauses_per_min)

  • Pause variability in seconds (pause_variability)

  • Mean pause time between words in seconds (mean_pause_length)

  • Speech percentage i.e., time spoken over total time (speech_percentage)

  • Parts of speech used in the entire turn, with a column each for:

    • noun_percentage
    • verb_percentage
    • adjective_percentage
    • pronoun_percentage
  • Emotional valence associated with the phrase:

    • Degree of positive valence ranging from 0-1 (sentiment_pos)
    • Degree of negative valence ranging from 0-1 (sentiment_neg)
    • Degree of neutral valence ranging from 0-1 (sentiment_neu)
    • Degree of overall valence ranging from 0-1 (sentiment_overall)
  • Lexical diversity as measured by the moving average type token ratio (MATTR) score (mattr)

  • Flag for negative pre-turn pause, i.e. interrupt (interrupt_flag) \

    Set to True if pre-turn pause is negative otherwise it is set to False.

2.4 – Summary

Finally, the function outputs a summary dataframe, which compiles file-level information.

  • Total length of speech in minutes (speech_length_minutes)
  • Total number of words spoken (speech_length_words)
  • Rate of speech in words per minute (words_per_min)
  • Articulation rate i.e. syllables per minute (syllables_per_min)
  • Pause rate in pauses per minute (pauses_per_min)
  • Pre-word pause mean in seconds (word_pause_length_mean)
  • Pre-word variability in seconds (word_pause_variability)
  • Pre-phrase pause mean in seconds (phrase_pause_length_mean)
  • Pre-phrase variability in seconds (phrase_pause_variability)
  • Speech percentage i.e., time spoken over total time (speech_percentage)
  • Parts of speech used in the entire file, with a column each for:
    • noun_percentage
    • verb_percentage
    • adjective_percentage
    • pronoun_percentage
  • Emotional valence associated with the speech:
    • Degree of positive valence ranging from 0-1 (sentiment_pos)
    • Degree of negative valence ranging from 0-1 (sentiment_neg)
    • Degree of neutral valence ranging from 0-1 (sentiment_neu)
    • Degree of overall valence ranging from 0-1 (sentiment_overall)
  • Lexical diversity as measured by the moving average type token ratio (MATTR) score (mattr)

In case the source JSON contains more than one speaker, we assume each speaker will have more than one turn. Hence, the following turn summary statistics will also be included (otherwise set to NaN):

  • Total number of turns (num_turns)
  • Mean length of turns in minutes (mean_turn_length_minutes)
  • Mean length of turns in words spoken (mean_turn_length_words)
  • Mean pause time before each turn (mean_pre_turn_pause)
  • Number of one-word turns (num_one_word_turns)
  • Number of interrupts, i.e. negative pre-turn pauses (num_interrupts)

    The sum of interrupt flags from the turns summary.

3 – Inputs

3.1 – transcript_json

Type json
Description output from speech transcription function

3.2 – language

Type str, optional, default = ‘en’
Description the language for which speech characteristics will be calculated. if the language is english, all shown variables will be calculated. if the language is not english, only language-independent variables will be calculated (e.g., pause characteristics).

3.3 – speaker_label

Type str, optional, default = None
Description the speaker label from the JSON file for which the speech characteristics are calculated

4 – Outputs

4.1 – words

Type pandas.DataFrame
Description summary of speech characteristics from inputted json file (word-level)

What a transpose of the data frame looks like:

pre_word_pause ← duration of pre-word pause in seconds
no_syllables ← number of syllables
part_of_speech ← whether the word is noun, verb, adjective or pronoun
sentiment_neg ← negative sentiment of speech
sentiment_neu ← neutral sentiment of speech
sentiment_pos ← positive sentiment of speech
sentiment_overall ← overall sentiment of speech

4.2 – phrases

Type pandas.DataFrame
Description summary of speech characteristics from inputted json file (phrase-level)

What a transpose of the data frame looks like:

pre_phrase_pause ← duration of pre-phrase pause in seconds
phrase_length_minutes ← total time of phrase
phrase_length_words ← total words spoken
words_per_min ← rate of speech in words per minute
syllables_per_min ← articulation rate i.e. syllables per minute
pauses_per_min ← pause rate i.e. number of pauses per minute
pause_variability ← pause variability in seconds
mean_pause_length ← mean pause time between words in seconds
speech_percentage ← speech percentage i.e., time spoken over total time
noun_percentage ← noun percentage
verb_percentage ← verb percentage
adj_percentage ← adjective percentage
pronoun_percentage ← pronoun percentage
sentiment_neg ← negative sentiment of speech
sentiment_neu ← neutral sentiment of speech
sentiment_pos ← positive sentiment of speech
sentiment_overall ← overall sentiment of speech
mattr ← moving average type token ratio

4.3 – turns

Type pandas.DataFrame or None
Description summary of speech characteristics from inputted json file (turn-level), only populated in case of multiple speakers

What a transpose of the data frame looks like:

pre_turn_pause ← duration of pre-turn pause in seconds
turn_length_minutes ← total time of turn in minutes
turn_length_words ← total words spoken
words_per_min ← rate of speech in words per minute
syllables_per_min ← articulation rate i.e. syllables per minute
pauses_per_min ← pause rate i.e. number of pauses per minute
pause_variability ← pause variability in seconds
mean_pause_length ← mean pause time between words in seconds
speech_percentage ← speech percentage i.e., time spoken over total time
noun_percentage ← noun percentage
verb_percentage ← verb percentage
adj_percentage ← adjective percentage
pronoun_percentage ← pronoun percentage
sentiment_pos ← positive sentiment of speech
sentiment_neg ← negative sentiment of speech
sentiment_neu ← neutral sentiment of speech
sentiment_overall ← overall sentiment of speech
mattr ← moving average type token ratio
interrupt_flag ← Flag for negative pre-turn pause, i.e. interrupt (interrupt_flag)

4.4 – summary

Type pandas.DataFrame
Description summary of speech characteristics from inputted json file (file-level)

What a transpose of the data frame looks like:

speech_length_minutes ← total length of speech in minutes
speech_length_words ← total words spoken
words_per_min ← rate of speech in words per minute
syllables_per_min ← articulation rate i.e. syllables per minute
pauses_per_min ← pauses per minute
word_pause_length_mean ← mean duration of pre-word pauses in seconds
word_pause_variability ← Variability of pre-word pauses in seconds
phrase_pause_length_mean ← mean duration of pre-phrase pauses in seconds
phrase_pause_variability ← Variability of pre-phrase pauses in seconds
speech_percentage ← speech percentage i.e., time spoken over total time
noun_percentage ← noun percentage
verb_percentage ← verb percentage
adj_percentage ← adjective percentage
pronoun_percentage ← pronoun percentage
sentiment_pos ← positive sentiment of speech
sentiment_neg ← negative sentiment of speech
sentiment_neu ← neutral sentiment of speech
sentiment_overall ← overall sentiment of speech
mattr ← moving average type token ratio
num_turns ← total number of turns
mean_turn_length_minutes ← mean length of turns in minutes
mean_turn_length_words ← mean length of turns in words spoken
turn_pause_length_mean ← mean pause time before each turn
num_one_word_turns ← number of one-word turns
num_interrupts ← number of interrupts, i.e. negative pre-turn pauses

5 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency License Justification
NLTK Apache 2.0 Well-established library for commonly measured natural language characteristics
LexicalRichness MIT Straightforward implementation of methods for calculation of MATTR score
vaderSentiment MIT Widely used library for sentiment analysis, trained on a large and heterogeneous dataset
Clone this wiki locally