# Speech Feature Engineering




## Convert Audio File Format

In [2]:
import os
from pydub import AudioSegment

For pydub was getting "No such file or directory: 'ffprobe'" error with pydub so instead used `conda install -c conda-forge ffmpeg` (rather than pip) to install.

In [3]:
# File paths for audio files
m4a_path = './data/m4a_files/'
wav_path = './data/wav_files/'

In [4]:
def convert_m4a_to_wav(file_name, input_dir, output_dir):
    '''
    Convert an .m4a audio file to a .wav audio file using PyDub.

    Inputs:
        file_name (str): name of file (without extension)
        input_dir (str): directory path for input .m4a file
        output_dir (str): directory path for output .wav file

    '''
    
    # Load the m4a file
    audio = AudioSegment.from_file(input_dir + file_name + '.m4a', format = 'm4a')
#     # Export the audio to wav format
#     audio.export(output_dir + file_name + '.wav', format = 'wav')
    
    # Set the desired sample rate and bit depth
    desired_sample_rate = 44100
    desired_sample_width = 2  # 16-bit depth

    # Resample the audio to the desired sample rate
    resampled_audio = audio.set_frame_rate(desired_sample_rate)

    # Set the bit depth to the desired value
    converted_audio = resampled_audio.set_sample_width(desired_sample_width)

    # Export the converted audio to a new WAV file
    converted_audio.export(output_dir + file_name + '.wav', format="wav")
    



    


In [5]:
# Create empty directories for wav files (if it doesn't exist)
if not os.path.exists(wav_path):
    os.makedirs(wav_path)


In [6]:
# List of file names
file_list = os.listdir(m4a_path) # List all files in original directory
# Updated list of files names
# remove extension and skip files that start with '.' (e.g. ipynb checkpoints)
file_list = [x.replace('.m4a', '')for x in file_list if x[0] != '.'] # Remove extension

In [7]:
file_list_temp = file_list[0:2]

In [8]:
# Convert all audio files to wav format
for file in file_list_temp:
    convert_m4a_to_wav(file, m4a_path, wav_path)


## Testing on a single file

In [9]:
file_name = "348th_11.4.21" # Audio File title
wav_file = wav_path + file_name + '.wav'

## My-Voice-Analysis Library

https://github.com/Shahabks/my-voice-analysis

From [paper](https://www.jmir.org/2021/4/e24191/) analyzing stress of health care professionals:

For the temporal features, the My-Voice Analysis [28] package was used. This package was built off of the speech analysis research tool praat [29]. Temporal features were actualized as the speech rate, syllable count, rate of articulation, speaking duration, total duration, and ratio of speaking to nonspeaking. This package was also used to extract prosodic features, namely the F0 values: mean, standard deviation, minimum, maximum, and upper and lower quartiles. The F0 value is the representation of what is known as the pitch.

In [None]:
# pip install my-voice-analysis

At first I was getting a "Try again the sound of the audio was not clear" response whenever I ran any of the functions from this package. I ended up copying the code from that package's repo and modified all the `sourcerun` file paths in the functions and now it works.

In [11]:
summary_dataset = mysp.mysptotal(file_name, wav_path[:-1])

In [12]:
summary_dataset.T

Unnamed: 0,0
number_ of_syllables,736.0
number_of_pauses,203.0
rate_of_speech,2.0
articulation_rate,5.0
speaking_duration,150.8
original_duration,309.3
balance,0.5
f0_mean,224.76
f0_std,62.28
f0_median,222.7


In [13]:
# Gender recognition and mood of speech:
gender_mood = mysp.myspgend(file_name, wav_path[:-1])
gender_mood

('a female, mood of speech: Reading, p-value/sample size= :0.00', 5)

In [14]:
def extract_mood(gender_mood_string):
    '''
    Want to extract the mood of speech from the gender and mood string from my-voice-analysis package
    
    For example, from the string:
    ('a female, mood of speech: Reading, p-value/sample size= :0.00', 5)
    I'd want to return "Reading"
    
    '''
    
    # Find the index of the first colon and the next comma after it
    colon_index = gender_mood_string.find(':')
    comma_index = gender_mood_string.find(',', colon_index)

    # Extract the text between the colon and comma using slicing
    mood = gender_mood_string[colon_index+2:comma_index]
    
    return mood
    

In [16]:
mood = extract_mood(gender_mood[0])
mood

'Reading'

## Python Speech Features Library

https://github.com/jameslyons/python_speech_features

[Mel Frequency Cepstral Coefficients (MFCC)](http://www.practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/)


From [paper](https://www.jmir.org/2021/4/e24191/) analyzing stress of health care professionals:

Formant features were calculated using the Python Speech Features library [30]. To characterize this aspect of speech, the original sound recording was refit according to a series of transformations commonly used for speech recognition that yield a better representation of the sound called the mel-frequency cepstrum (MFC). From this new representation of the sound form, the first 14 coefficients of the MFC were extracted. The MFC values were extracted given that they describe the spectral shape of the audio file, generally with diminishing returns in terms of how informative they are, which is why we only considered the first 14 coefficients. If we were to select a greater number of MFC values, it would result in a potentially needlessly more complex machine learning model using less informative features.

From each of these waves, the mean, variance, skewness, and kurtosis were calculated for the energy (static coefficient), velocity (first differential), and acceleration (second differential).


In [20]:
# pip install python_speech_features

In [24]:
from python_speech_features import mfcc
from python_speech_features import logfbank
import scipy.io.wavfile as wav


In [25]:
(rate,sig) = wav.read(wav_file)
mfcc_feat = mfcc(sig, rate, nfft = 2000)
fbank_feat = logfbank(sig, rate, nfft = 2000)

In [31]:
mfcc_feat

array([[ 10.29968018,  13.72290818, -29.19809589, ..., -13.55647358,
         -4.20161053,  11.28364227],
       [ 10.47709217,  14.78715985, -29.81894629, ..., -18.21122394,
         -9.48733424,   4.28475425],
       [ 11.78522646,  25.73448529,  -7.67875987, ...,  -5.00370421,
         -1.36992302,   2.23719493],
       ...,
       [ 15.39286591,  24.08755919,  -0.10427238, ...,  -3.71280953,
        -27.26275844,  -9.63292934],
       [ 14.47122335,  19.26920955,   4.0923109 , ...,  -5.08731833,
        -22.63729858,  -7.90971875],
       [  3.04433085, -12.69258491,  -7.31047902, ..., -12.002807  ,
         -9.2831477 ,  -4.79648642]])

In [32]:
mfcc_feat.shape

(30928, 13)

In [33]:
# TODO: extract mean, stdev, skewness, kurtosis

## Librosa Library

https://pypi.org/project/librosa/

From [paper](https://www.jmir.org/2021/4/e24191/) analyzing stress of health care professionals:

The Librosa package [31] was used to calculate the mean, maximum, minimum, and standard deviation of the root mean square value, centroid, bandwidth, flatness, zero-crossing rate, loudness, and flux of the spectrogram, or the visualization of the recording.

In [None]:
# pip install librosa


In [None]:
import librosa

In [17]:
# Load the WAV file using librosa
y, sr = librosa.load(wav_file)

# Extract pitch, loudness, and spectral centroid
pitch = librosa.pitch.piptrack(y=y, sr=sr)
loudness = librosa.amplitude_to_db(librosa.feature.rms(y=y))
spec_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)

# Print the features
print("Pitch:", pitch)
print("Loudness:", loudness)
print("Spectral centroid:", spec_centroid)

Pitch: (array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32))
Loudness: [[-48.007576 -39.336563 -32.142986 ... -22.90003  -24.209024 -26.71492 ]]
Spectral centroid: [[1854.40866307  945.04239462  713.05587732 ... 1150.47042849
  1063.63073103  891.03299265]]
