Skip to content

Vocal acoustics v2.0

GeorgiosEfstathiadis edited this page Mar 19, 2024 · 3 revisions
Date completed March 19, 2024
Release where first appeared OpenWillis v2.1
Researcher / Developer Vijay Yadav, Georgios Efstathiadis

1 – Use

import openwillis as ow

framewise, summary = ow.vocal_acoustics(audio_path = 'audio.wav', option = 'simple')

2 – Methods

Calculating a list of vocal acoustic features from inputted audio (only .wav and .mp3 files supported)

  • First, a set of vocal acoustic properties that have framewise values are calculated through Parselmouth and saved in framewise. This includes the following variables:
  • In the summary output, the mean, standard deviation, and range of each of the variables from the first step are saved.
  • The pause information is compiled into three variables, also saved in summary:
    • Number of pauses per minute (pause_rate)
    • Mean duration of pauses (pause_meandur), measured in seconds
    • Silence ratio (silence_ratio), the percentage of frames with no voice detected
  • Parselmouth is used to calculate an additional set of variables that pertain to the entirety of the audio file rather than be framewise measures. These are saved directly in the summary output:
  • Additionally from the previous outputs the following measures are saved in the summary output (Kovac et al., 2023):
    • Pitch variation (relF0SD), defined as a standard deviation of F0 contour of voiced segments longer than 100 ms relative to its mean
    • Speech Loudness variation (relSE0SD), defined as a standard deviation of energy of voiced segments longer than 100 ms relative to its mean
    • SPIR, Number of pauses (longer than 50 ms and shorter than 2 s) relative to total speech time.
    • DurMED, Median duration of silences longer than 50 ms and shorter than 2 s
    • DurMAD, Median absolute deviation of silence duration (longer than 50 ms and shorter than 2s)
  • The mean and variance of cepstral features are also calculated and saved in the summary output (Silva et al., 2021; Jiang et al., 2017; Dumpala et al., 2021; Berardi et al., 2023):
  • Vocal tremor related statistics are also saved in the summary output calculated from tremor.praat (these are meaningful when input is a sustained vowel phonation) and they include the following:
    • frequency contour magnitude (FCoM),
    • (maximum) frequency tremor cyclicality (FTrC),
    • number of frequency modulations above thresholds (FMoN),
    • (strongest) frequency tremor frequency (FTrF),
    • frequency tremor intensity index (FTrI) at FTrF,
    • frequency tremor power index (FTrP) at FTrF,
    • frequency tremor cyclicality intensity product (FTrCIP) at FTrF,
    • frequency tremor product sum (FTrPS),
    • frequency contour harmonicity-to-noise ratio (FCoHNR),
    • amplitude contour magnitude (ACoM)
    • (maximum) amplitude tremor cyclicality (ATrC)
    • number of amplitude modulations above thresholds (AMoN)
    • (strongest) amplitude tremor frequency (ATrF)
    • amplitude tremor intensity index (ATrI)
    • amplitude tremor power index (ATrP)
    • amplitude tremor cyclicality intensity product (ATrCIP)
    • amplitude tremor product sum (ATrPS)
    • amplitude contour harmonicity-to-noise ratio (ACoHNR).
  • Glotta features are also calculated and saved in the summary output (note they take a long time to compute). These include:

3 – Inputs

3.1 – audio_path

Type str
Description path to audio file; can only support .wav files

3.2 – option

Type str
Description Default is ‘simple’, string that determines measures calculated; can be ‘simple’, ‘advanced’ or ‘tremor’
Option List of variables calculated
simple Parselmouth measures, pause measures, cepstral measures
tremor Simple measures + tremor measures
advanced Simple measures + tremor measures and glottal measures

4 – Outputs

4.1 – framewise

Type data-type
Description framewise output of acoustic properties that can be calculated for individual frames. columns represent variables, rows represent frames

What the data frame looks like:

frame f0 f1 f2 f3 f4 loudness hnr
0
1
...

4.2 – summary

Type data-type
Description final output of all vocal acoustic measures calculated from the input audio file.

5 – Example use

Here, we use this function to process a sample audio file included in the repository.

import openwillis as ow

framewise, summary = ow.vocal_acoustics(audio_path = 'data/trim.wav')
framewise.head(2)
frame f0 loudness hnr form1freq form2freq form3freq form4freq
0 107.72 49.71 7.88 439.77 1720.29 2662.75 4328.91
1 105.88 48.59 9.10 376.80 2513.84 2667.70 4105.55

6 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency License Justification
Parselmouth GPL 3.0 License Python implementation of the Praat software library, a long trusted source for measurement methods in vocal acoustics
Pydub MIT License Open source and accurate methods for analysis of audio files; using it to parse speech versus silence in audio files
DisVoice MIT License Only using the glottal module for calculation of advanced glottal features (HRF, NAQ, QOQ), which is a python implementation of the widely used MatLab COVAREP project.
pysptk MIT License A python wrapper for Speech Signal Processing Toolkit (SPTK), used in DisVoice feature calculations
librosa ISC License Packaged for music and audio analysis, used in cepstral variable calculation and specifically extraction of MFCCs.

7 – References

Berardi, M. L., Brosch, K., Pfarr, J., Schneider, K., Sültmann, A., Thomas-Odenthal, F., Wroblewski, A., Usemann, P., Philipsen, A., Dannlowski, U., Nenadić, I., Kircher, T., Krug, A., Stein, F., & Dietrich, M. (2023). Relative importance of speech and voice features in the classification of schizophrenia and depression. Translational Psychiatry, 13(1). https://doi.org/10.1038/s41398-023-02594-0

Dumpala, S. H., Rempel, S., Dikaios, K., Sajjadian, M., Uher, R., & Oore, S. (2021). Estimating Severity of Depression From Acoustic Features and Embeddings of Natural Speech. IEE. https://doi.org/10.1109/icassp39728.2021.9414129

Jiang, H., Hu, B., Liu, Z., Yan, L., Wang, T., Liu, F., Kang, H., & Li, X. (2017). Investigation of different speech types and emotions for detecting depression using different classifiers. Speech Communication, 90, 39–46. https://doi.org/10.1016/j.specom.2017.04.001

Silva, W. J., Lopes, L. W., Galdino, M. K. C., & Almeida, A. A. (2021). Voice Acoustic Parameters as Predictors of Depression. Journal of Voice. https://doi.org/10.1016/j.jvoice.2021.06.018

Kovac, D., Mekyska, J., Brabenec, L., Košťálová, M., & Rektorová, I. (2023). Research on passive assessment of Parkinson’s Disease utilising speech biomarkers. In Pervasive Computing Technologies for Healthcare (pp. 259–273). https://doi.org/10.1007/978-3-031-34586-9_18

Clone this wiki locally