# Speech Feature Engineering




In [1]:
# One file to test on
# wav_path = './data/wav_files/'
wav_path = './data/teacher_wav_files/'
file_name = '228_3.4.20_S_SC'
wav_file = wav_path + file_name + '.wav'

## My-Voice-Analysis Library

https://github.com/Shahabks/my-voice-analysis

From [paper](https://www.jmir.org/2021/4/e24191/) analyzing stress of health care professionals:

For the temporal features, the My-Voice Analysis [28] package was used. This package was built off of the speech analysis research tool praat [29]. Temporal features were actualized as the speech rate, syllable count, rate of articulation, speaking duration, total duration, and ratio of speaking to nonspeaking. This package was also used to extract prosodic features, namely the F0 values: mean, standard deviation, minimum, maximum, and upper and lower quartiles. The F0 value is the representation of what is known as the pitch.

Temporal characteristics include measures of the proportion of speech (eg, duration of pauses and duration of speech segments), speech segment connectivity, and overall speech rate.

Prosodic characteristics, on long-term variations in perceived stress and speech rhythm. Prosodic features also measure alterations in personal speech style (eg, perceived pitch and speech intonation).

At first I was getting a "Try again the sound of the audio was not clear" response whenever I ran any of the functions from this package. I ended up copying the code from that package's repo and modified all the `sourcerun` file paths in the functions and now it works.

In [2]:
# This will import it from the local version I edited slightly
mysp=__import__('my-voice-analysis')

In [3]:
summary_dataset = mysp.mysptotal(file_name, wav_path[:-1])

In [4]:
summary_dataset.T

Unnamed: 0,0
number_ of_syllables,645.0
number_of_pauses,82.0
rate_of_speech,4.0
articulation_rate,5.0
speaking_duration,118.9
original_duration,183.0
balance,0.6
f0_mean,216.9
f0_std,54.33
f0_median,214.0


In [5]:
# Gender recognition and mood of speech:
gender_mood = mysp.myspgend(file_name, wav_path[:-1])
gender_mood

('a female, mood of speech: Reading, p-value/sample size= :0.00', 5)

In [6]:
def extract_mood(gender_mood_string):
    '''
    Want to extract the mood of speech from the gender and mood string from my-voice-analysis package
    
    For example, from the string:
    ('a female, mood of speech: Reading, p-value/sample size= :0.00', 5)
    I'd want to return "Reading"
    
    '''
    
    # Find the index of the first colon and the next comma after it
    colon_index = gender_mood_string.find(':')
    comma_index = gender_mood_string.find(',', colon_index)

    # Extract the text between the colon and comma using slicing
    mood = gender_mood_string[colon_index+2:comma_index]
    
    return mood
    

In [7]:
mood = extract_mood(gender_mood[0])
mood

'Reading'

## Python Speech Features Library

https://github.com/jameslyons/python_speech_features

[Mel Frequency Cepstral Coefficients (MFCC)](http://www.practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/)


From [paper](https://www.jmir.org/2021/4/e24191/) analyzing stress of health care professionals:

Formant features were calculated using the Python Speech Features library [30]. To characterize this aspect of speech, the original sound recording was refit according to a series of transformations commonly used for speech recognition that yield a better representation of the sound called the mel-frequency cepstrum (MFC). From this new representation of the sound form, the first 14 coefficients of the MFC were extracted. The MFC values were extracted given that they describe the spectral shape of the audio file, generally with diminishing returns in terms of how informative they are, which is why we only considered the first 14 coefficients. If we were to select a greater number of MFC values, it would result in a potentially needlessly more complex machine learning model using less informative features.

From each of these waves, the mean, variance, skewness, and kurtosis were calculated for the energy (static coefficient), velocity (first differential), and acceleration (second differential).

Formant characteristics represent the dominant components of the speech spectrum and convey information about the acoustic resonance of the vocal tract and its use. These markers are often indicative of articulatory coordination problems in motor speech control disorders.

In [8]:
# pip install python_speech_features

In [9]:
from python_speech_features import mfcc
from python_speech_features import logfbank
import scipy.io.wavfile as wav
from scipy.stats import kurtosis, skew
import pandas as pd
import numpy as np

In [10]:
(rate,sig) = wav.read(wav_file)
num_mfccs = 13
mfcc_feat = mfcc(sig, rate, nfft = 2000, numcep = num_mfccs)
# fbank_feat = logfbank(sig, rate, nfft = 2000)

In [11]:
mfcc_feat

array([[  8.21398726,  -0.94272344,   1.61217927, ...,  -5.22272978,
         10.41307205,   7.18030457],
       [  9.49550784,  10.74725207, -10.09674287, ...,  -7.3159577 ,
         17.53966079,   9.70359799],
       [ 10.56571434,  11.60457192, -18.64122097, ..., -12.30325592,
         15.31123069,  12.63877943],
       ...,
       [ 15.35200665,  25.40331055,  16.44454652, ..., -16.59204303,
        -18.24428657, -19.38499411],
       [ 15.14867481,  27.10373493,  18.54972296, ..., -15.62857001,
        -15.50449545, -13.23567051],
       [ 15.13878566,  27.68397097,  19.47926471, ..., -20.72077487,
        -18.03577014, -13.06890145]])

In [12]:
mfcc_feat.shape

(18301, 13)

In [13]:
df_mean = pd.DataFrame(mfcc_feat.mean(axis = 0)).T
df_mean.columns = [f'MFCC_{i+1}_Mean' for i in range(num_mfccs)]


In [14]:
df_var = pd.DataFrame(mfcc_feat.var(axis = 0)).T
df_var.columns = [f'MFCC_{i+1}_Var' for i in range(num_mfccs)]


In [15]:
df_skew = pd.DataFrame(skew(mfcc_feat)).T
df_skew.columns = [f'MFCC_{i+1}_Skew' for i in range(num_mfccs)]


In [16]:
df_kurtosis = pd.DataFrame(kurtosis(mfcc_feat)).T
df_kurtosis.columns = [f'MFCC_{i+1}_Kurtosis' for i in range(num_mfccs)]


In [17]:
df_mfcc = pd.concat([df_mean, df_var, df_skew, df_kurtosis], axis=1)
df_mfcc.T

Unnamed: 0,0
MFCC_1_Mean,13.523993
MFCC_2_Mean,17.242075
MFCC_3_Mean,-5.181864
MFCC_4_Mean,-4.193514
MFCC_5_Mean,4.482192
MFCC_6_Mean,-7.730148
MFCC_7_Mean,-3.215254
MFCC_8_Mean,-7.670091
MFCC_9_Mean,-13.003896
MFCC_10_Mean,-3.161761


## Librosa Library

https://pypi.org/project/librosa/

From [paper](https://www.jmir.org/2021/4/e24191/) analyzing stress of health care professionals:

The Librosa package [31] was used to calculate the mean, maximum, minimum, and standard deviation of the root mean square value, centroid, bandwidth, flatness, zero-crossing rate, loudness, and flux of the spectrogram, or the visualization of the recording.

In [18]:
# pip install librosa


In [19]:
import librosa

In [21]:
# Load the WAV file using librosa
y, sr = librosa.load(wav_file)


In [58]:
rms = librosa.feature.rms(y=y)


In [59]:
rms

array([[0.00092698, 0.00125816, 0.00255589, ..., 0.05829348, 0.0525038 ,
        0.04251017]], dtype=float32)

In [60]:
rms.shape

(1, 7883)

In [31]:
def calc_mean_max_min_stdev(array):
    return [np.mean(array), np.max(array), np.min(array), np.std(array)]

In [38]:
# Root mean square value
rms = pd.DataFrame(calc_mean_max_min_stdev(librosa.feature.rms(y=y))).T
rms.columns = [f'RMS_{i}' for i in ['Mean', 'Min', 'Max', 'Std']]


In [39]:
centroid = pd.DataFrame(calc_mean_max_min_stdev(librosa.feature.spectral_centroid(y=y))).T
centroid.columns = [f'Centroid_{i}' for i in ['Mean', 'Min', 'Max', 'Std']]


In [40]:
bandwidth = pd.DataFrame(calc_mean_max_min_stdev(librosa.feature.spectral_bandwidth(y=y))).T
bandwidth.columns = [f'Bandwidth_{i}' for i in ['Mean', 'Min', 'Max', 'Std']]


In [41]:
flatness = pd.DataFrame(calc_mean_max_min_stdev(librosa.feature.spectral_flatness(y=y))).T
flatness.columns = [f'Flatness_{i}' for i in ['Mean', 'Min', 'Max', 'Std']]


In [47]:
zero_crossing_rate = pd.DataFrame(calc_mean_max_min_stdev(librosa.feature.zero_crossing_rate(y=y))).T
zero_crossing_rate.columns = [f'Zero_Crossing_Rate_{i}' for i in ['Mean', 'Min', 'Max', 'Std']]


In [48]:
loudness = pd.DataFrame(calc_mean_max_min_stdev(librosa.amplitude_to_db(librosa.feature.rms(y=y)))).T
loudness.columns = [f'Loudness_{i}' for i in ['Mean', 'Min', 'Max', 'Std']]


In [49]:
df_librosa = pd.concat([rms, centroid, bandwidth, 
                        flatness, zero_crossing_rate, loudness], axis=1)
df_librosa.T

Unnamed: 0,0
RMS_Mean,0.04572282
RMS_Min,0.3379434
RMS_Max,0.0001835118
RMS_Std,0.04680617
Centroid_Mean,1166.937
Centroid_Min,5838.62
Centroid_Max,271.6106
Centroid_Std,629.3843
Bandwidth_Mean,1288.237
Bandwidth_Min,3084.776
