-
Notifications
You must be signed in to change notification settings - Fork 7
Vocal acoustics v2.0
Date completed | March 19, 2024 |
Release where first appeared | OpenWillis v2.1 |
Researcher / Developer | Vijay Yadav, Georgios Efstathiadis |
import openwillis as ow
framewise, summary = ow.vocal_acoustics(audio_path = 'audio.wav', option = 'simple')
Calculating a list of vocal acoustic features from inputted audio (only .wav and .mp3 files supported)
- First, a set of vocal acoustic properties that have framewise values are calculated through Parselmouth and saved in framewise. This includes the following variables:
- Fundamental frequency (f0), measured in Hertz
- Formant frequencies 1 through 4 (f1, f2, f3, and f4), measured in Hertz
- Loudness, measured in decibels
- Harmonics-to-noise ratio (hnr)
- In the summary output, the mean, standard deviation, and range of each of the variables from the first step are saved.
- The pause information is compiled into three variables, also saved in summary:
- Number of pauses per minute (
pause_rate
) - Mean duration of pauses (
pause_meandur
), measured in seconds - Silence ratio (
silence_ratio
), the percentage of frames with no voice detected
- Number of pauses per minute (
- Parselmouth is used to calculate an additional set of variables that pertain to the entirety of the audio file rather than be framewise measures. These are saved directly in the summary output:
- Jitter (absolute)
- Jitter (rap)
- Jitter (ppq5)
- Jitter (ddp)
- Shimmer (absolute)
- Shimmer (db)
- Shimmer (apq3)
- Shimmer (apq5)
- Shimmer (apq11)
- Shimmer (dda)
- Glottal-to-noise excitation ratio
- Additionally from the previous outputs the following measures are saved in the summary output (Kovac et al., 2023):
- Pitch variation (relF0SD), defined as a standard deviation of F0 contour of voiced segments longer than 100 ms relative to its mean
- Speech Loudness variation (relSE0SD), defined as a standard deviation of energy of voiced segments longer than 100 ms relative to its mean
- SPIR, Number of pauses (longer than 50 ms and shorter than 2 s) relative to total speech time.
- DurMED, Median duration of silences longer than 50 ms and shorter than 2 s
- DurMAD, Median absolute deviation of silence duration (longer than 50 ms and shorter than 2s)
- The mean and variance of cepstral features are also calculated and saved in the summary output (Silva et al., 2021; Jiang et al., 2017; Dumpala et al., 2021; Berardi et al., 2023):
- First 14 Mel-Frequency Cepstral Coefficients (MFCCs), are a low dimension representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency
- Cepstral Peak Prominence (CPP), measure of breathiness and overall dysphonia
- Vocal tremor related statistics are also saved in the summary output calculated from tremor.praat (these are meaningful when input is a sustained vowel phonation) and they include the following:
- frequency contour magnitude (FCoM),
- (maximum) frequency tremor cyclicality (FTrC),
- number of frequency modulations above thresholds (FMoN),
- (strongest) frequency tremor frequency (FTrF),
- frequency tremor intensity index (FTrI) at FTrF,
- frequency tremor power index (FTrP) at FTrF,
- frequency tremor cyclicality intensity product (FTrCIP) at FTrF,
- frequency tremor product sum (FTrPS),
- frequency contour harmonicity-to-noise ratio (FCoHNR),
- amplitude contour magnitude (ACoM)
- (maximum) amplitude tremor cyclicality (ATrC)
- number of amplitude modulations above thresholds (AMoN)
- (strongest) amplitude tremor frequency (ATrF)
- amplitude tremor intensity index (ATrI)
- amplitude tremor power index (ATrP)
- amplitude tremor cyclicality intensity product (ATrCIP)
- amplitude tremor product sum (ATrPS)
- amplitude contour harmonicity-to-noise ratio (ACoHNR).
- Glotta features are also calculated and saved in the summary output (note they take a long time to compute). These include:
- Average of Harmonic Richness Factor (HRF), ratio of the sum of the harmonics amplitude and the amplitude of the fundamental frequency
- Variability of Harmonic Richness Factor (HRF)
- Average Normalized Amplitude Quotient (NAQ) for consecutive glottal cycles-> ratio of the amplitude quotient and the duration of the glottal cycle
- Variability of Normalized Amplitude Quotient (NAQ) for consecutive glottal cycles-> ratio of the amplitude quotient and the duration of the glottal cycle
- Average opening quotient (OQ) for consecutive glottal cycles-> rate of opening phase duration / duration of glottal cycle
- Variability of opening quotient (OQ) for consecutive glottal cycles-> rate of opening phase duration /duration of glottal cycle
Type | str |
Description | path to audio file; can only support .wav files |
Type | str |
Description | Default is ‘simple’, string that determines measures calculated; can be ‘simple’, ‘advanced’ or ‘tremor’ |
Option | List of variables calculated |
simple | Parselmouth measures, pause measures, cepstral measures |
tremor | Simple measures + tremor measures |
advanced | Simple measures + tremor measures and glottal measures |
Type | data-type |
Description | framewise output of acoustic properties that can be calculated for individual frames. columns represent variables, rows represent frames |
What the data frame looks like:
frame | f0 | f1 | f2 | f3 | f4 | loudness | hnr |
0 | |||||||
1 | |||||||
... |
Type | data-type |
Description | final output of all vocal acoustic measures calculated from the input audio file. |
Here, we use this function to process a sample audio file included in the repository.
import openwillis as ow
framewise, summary = ow.vocal_acoustics(audio_path = 'data/trim.wav')
framewise.head(2)
frame | f0 | loudness | hnr | form1freq | form2freq | form3freq | form4freq |
0 | 107.72 | 49.71 | 7.88 | 439.77 | 1720.29 | 2662.75 | 4328.91 |
1 | 105.88 | 48.59 | 9.10 | 376.80 | 2513.84 | 2667.70 | 4105.55 |
Below are dependencies specific to calculation of this measure.
Dependency | License | Justification |
Parselmouth | GPL 3.0 License | Python implementation of the Praat software library, a long trusted source for measurement methods in vocal acoustics |
Pydub | MIT License | Open source and accurate methods for analysis of audio files; using it to parse speech versus silence in audio files |
DisVoice | MIT License | Only using the glottal module for calculation of advanced glottal features (HRF, NAQ, QOQ), which is a python implementation of the widely used MatLab COVAREP project. |
pysptk | MIT License | A python wrapper for Speech Signal Processing Toolkit (SPTK), used in DisVoice feature calculations |
librosa | ISC License | Packaged for music and audio analysis, used in cepstral variable calculation and specifically extraction of MFCCs. |
Berardi, M. L., Brosch, K., Pfarr, J., Schneider, K., Sültmann, A., Thomas-Odenthal, F., Wroblewski, A., Usemann, P., Philipsen, A., Dannlowski, U., Nenadić, I., Kircher, T., Krug, A., Stein, F., & Dietrich, M. (2023). Relative importance of speech and voice features in the classification of schizophrenia and depression. Translational Psychiatry, 13(1). https://doi.org/10.1038/s41398-023-02594-0
Dumpala, S. H., Rempel, S., Dikaios, K., Sajjadian, M., Uher, R., & Oore, S. (2021). Estimating Severity of Depression From Acoustic Features and Embeddings of Natural Speech. IEE. https://doi.org/10.1109/icassp39728.2021.9414129
Jiang, H., Hu, B., Liu, Z., Yan, L., Wang, T., Liu, F., Kang, H., & Li, X. (2017). Investigation of different speech types and emotions for detecting depression using different classifiers. Speech Communication, 90, 39–46. https://doi.org/10.1016/j.specom.2017.04.001
Silva, W. J., Lopes, L. W., Galdino, M. K. C., & Almeida, A. A. (2021). Voice Acoustic Parameters as Predictors of Depression. Journal of Voice. https://doi.org/10.1016/j.jvoice.2021.06.018
Kovac, D., Mekyska, J., Brabenec, L., Košťálová, M., & Rektorová, I. (2023). Research on passive assessment of Parkinson’s Disease utilising speech biomarkers. In Pervasive Computing Technologies for Healthcare (pp. 259–273). https://doi.org/10.1007/978-3-031-34586-9_18
OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.
- Release notes
- Getting started
-
List of functions
- Facial Expressivity v2.0
- Emotional Expressivity v2.0
- Eye Blink Rate v1.0
- Speech Transcription with Vosk v1.0
- Speech Transcription with Whisper v1.1
- Speech Transcription with AWS v1.1
- Speaker Separation with Labels v1.0
- Speaker Separation without Labels v1.0
- Vocal Acoustics v2.0
- Speech Characteristics v3.0
- GPS Analysis v1.0
- Research guidelines