In [1]:
%matplotlib inline

Note: if plots don't show, there is a recent matplotlib issue that may be related:
https://github.com/matplotlib/matplotlib/issues/18396


# Augment Speech and Sound for Machine Learning


Use soundpy to augment audio signals. 

To see how soundpy implements this, see `soundpy.augment`.

Note: soundpy is an experimental framework. This is a package mainly for exploring sound in the context of machine learning; testing of functionality is in constant progress.



In [2]:
# to be able to import soundpy from parent directory:
import os
package_dir = '../'
os.chdir(package_dir)

In [3]:
import soundpy as sp
import IPython.display as ipd

## Let's work with speech and sound (car horn)

How one augments audio data (images) **depends on the type of audio.** 

**Speech data tends to only be useful if the phonological rules of a language are respected.** For example, splitting a speech signal into several sections might result in unrealistic representations of speech. (However, maybe it will work for a specific purpose, e.g. gender/age classification, if the order of speech segments don't matter.)

Sound data in general will only be fed to neural networks in a limited way. For example, for computer vision, while a car might appear in pictures up-side-down (car accident) or on its side (car trick), sound is recorded via microphone and digitized which means it can only really be visualized accordingly. We don't record things in an up-side-down manner. **Therefore, one doesn't tend to rotate sound 'images' (the signal and stft graphs below) the same way one might images for object recognition.** 

Finally, these augmenting functions are a selection from those applied in research (the specific papers can be found in the source code). There are other ways of augmenting sound data. 

In [4]:
# Use function 'string2pathlib' to turn string path into pathlib object 
# This allows flexibility across operating systems
speech_path = sp.string2pathlib('audiodata/python.wav')
honk_path = sp.string2pathlib('audiodata/car_horn.wav')

# Hear and see speech 

(later we'll examine the non-speech sound)

### Note: you can visualize the sound with `feature_type` set as:

* 'stft' (default for this notebook)
* 'signal' 
* 'fbank' 
* 'mfcc'
* 'powspec'

In [5]:
feature_type = 'stft'
sr = 48000
f, sr = sp.loadsound(speech_path, sr=sr)
ipd.Audio(f,rate=sr)

In [6]:
sp.plotsound(f, sr=sr, feature_type=feature_type, title='Female Speech ({})'.format(feature_type.upper()))

# Augmentation appropriate for speech signals 


## Change Speed



Let's increase the speed by 15% (default setting).


In [7]:
perc = 0.15
fast = sp.augment.speed_increase(f, sr=sr, perc = perc)

In [8]:
ipd.Audio(fast,rate=sr)

In [9]:
sp.plotsound(fast, sr=sr, feature_type=feature_type, 
               title='Female speech {} \n{}%  faster'.format(feature_type.upper(), int(perc*100)))

Let's decrease the speed by 15%:



In [10]:
slow = sp.augment.speed_decrease(f, sr=sr, perc = perc)

In [11]:
ipd.Audio(slow,rate=sr)

In [12]:
sp.plotsound(slow, sr=sr, feature_type=feature_type, 
               title='Speech ({}) \n{}%  slower'.format(feature_type.upper(), int(perc*100)))

In [13]:
slow_stft = sp.feats.get_feats(slow, sr=sr, feature_type=feature_type)

In [14]:
sp.feats.plot(slow_stft, feature_type='stft')

## Add Noise




Add white noise: 10 SNR



In [15]:
snr = 10
noisy = sp.augment.add_white_noise(f, sr=sr, snr = snr)




In [16]:
ipd.Audio(noisy,rate=sr)

In [17]:
sp.plotsound(noisy, sr=sr, feature_type=feature_type, 
               title='Speech with white noise ({}) \n{} SNR'.format(feature_type.upper(), snr))

## Harmonic Distortion

Sine function applied to signal 5 times.

In [18]:
hd = sp.augment.harmonic_distortion(f, sr=sr)

In [19]:
ipd.Audio(hd,rate=sr)

In [20]:
sp.plotsound(hd, sr=sr, feature_type=feature_type, 
               title='Speech with harmonic distortion ({})'.format(feature_type.upper()))

## Pitch Shift


### Pitch shift increase



In [21]:
num_semitones = 2
psi = sp.augment.pitch_increase(f, sr=sr, num_semitones = num_semitones)

In [22]:
ipd.Audio(psi,rate=sr)

In [23]:
sp.plotsound(psi, sr=sr, feature_type=feature_type, 
               title='Speech ({})\nwith pitch shift increase of ({}) semitones'.format(feature_type.upper(), num_semitones))

### Pitch shift decrease



In [24]:
psd = sp.augment.pitch_decrease(f, sr=sr, num_semitones = num_semitones)

In [25]:
ipd.Audio(psd,rate=sr)

In [26]:
sp.plotsound(psd, sr=sr, feature_type=feature_type, 
               title='Speech ({})\nwith pitch shift decrease of ({}) semitones'.format(feature_type.upper(), num_semitones))

## Vocal Tract Length Perturbation (VTLP) by factor 0.8 to 1.2

### (Very experimental)

This function returns a STFT matrix (not sample data) as well as warping factor.


### Vocal tract length perturbation: factor 0.8



In [27]:
vtlp_stft, a = sp.augment.vtlp(f, sr=sr, win_size_ms = 50,
                                 percent_overlap = 0.5,
                                 random_seed = 41)

In order to listen to this, we need to turn the stft into 
samples:



In [28]:
vtlp_y = sp.feats.feats2audio(vtlp_stft, sr = sr,
                                feature_type = 'stft',
                                win_size_ms = 50,
                                percent_overlap = 0.5)
ipd.Audio(vtlp_y,rate=sr)

In [29]:
# this function plots the features directly whereas plotsound extracts the features first.
sp.feats.plot(vtlp_stft, sr=sr, feature_type='stft', 
               title='VTLP (STFT)\nfactor {}'.format(a))

### Vocal tract length perturbation: factor 1.2



In [30]:
vtlp_stft, a = sp.augment.vtlp(f, sr=sr, win_size_ms = 50,
                                 percent_overlap = 0.5,
                                 random_seed = 43)

In order to listen to this, we need to turn the stft into 
samples:



In [31]:
vtlp_y = sp.feats.feats2audio(vtlp_stft, sr = sr,
                                feature_type = 'stft',
                                win_size_ms = 50,
                                percent_overlap = 0.5)
ipd.Audio(vtlp_y,rate=sr)

In [32]:
sp.feats.plot(vtlp_stft, sr=sr, feature_type='stft', 
               title='VTLP (STFT)\nfactor {}'.format(a))

# Augmentation appropriate for non-speech signals 



## Hear and see sound signal 



In [33]:
h, sr = sp.loadsound(honk_path, sr=sr)
ipd.Audio(h,rate=sr)

In [34]:
sp.plotsound(h, sr=sr, feature_type=feature_type, 
               title='Car Horn Sound ({})'.format(feature_type.upper()))

## Time Shift



### We'll apply a random time shift to the sound

You can set a `random_seed`.


In [35]:
h_shift = sp.augment.time_shift(h, sr=sr)

In [36]:
ipd.Audio(h_shift,rate=sr)

In [37]:
sp.plotsound(h_shift, sr=sr, feature_type=feature_type, 
               title='Car horn ({})\nTime Shifted'.format(feature_type.upper()))

## Shuffle the Sound

It's hard to tell the difference between the time shift and shuffle with this sound, but the difference is that `shufflesound` divides the sound into `num_subsections` and then shuffles it, while `shift` is divides the sound into just 2 sections and swaps them.


In [38]:
h_shuffle = sp.augment.shufflesound(h, sr=sr,
                                      num_subsections = 5)

In [39]:
ipd.Audio(h_shuffle,rate=sr)

In [40]:
sp.plotsound(h_shuffle, sr=sr, feature_type=feature_type, 
               title='Car horn ({})\nShuffled'.format(feature_type.upper()))

### Just for comparison.. apply this to speech!

Let's have a listen and look at what happens when applied to speech:

In [41]:
f_shuffle = sp.augment.shufflesound(f, sr=sr,
                                      num_subsections = 4)

In [42]:
ipd.Audio(f_shuffle,rate=sr)

In [43]:
sp.plotsound(f_shuffle, sr=sr, feature_type=feature_type, 
               title='Speech ({})\nShuffled'.format(feature_type.upper()))

## Add Noise


### Add white noise as SNR 10

Feel free to play around with the SNR value.

In [44]:
snr = 10

In [45]:
h_noisy = sp.augment.add_white_noise(h, sr=sr, snr = snr)

In [46]:
ipd.Audio(h_noisy,rate=sr)

In [47]:
sp.plotsound(h_noisy, sr=sr, feature_type=feature_type, 
               title='Car horn ({})\nwith white noise ({} SNR)'.format(feature_type.upper(), snr))

### Change Speed



In [48]:
perc = .15
h_fast = sp.augment.speed_increase(h, sr=sr, perc = perc) # default perc set to 0.15

In [49]:
ipd.Audio(h_fast,rate=sr)

In [50]:
sp.plotsound(h_fast, sr=sr, feature_type=feature_type, 
               title='Car horn ({})\nSpeed increase {}%'.format(feature_type.upper(), int(perc*100)))

In [51]:
h_slow = sp.augment.speed_decrease(h, sr=sr, perc=perc) # default perc set to 0.15

In [52]:
ipd.Audio(h_slow,rate=sr)

In [53]:
sp.plotsound(h_slow, sr=sr, feature_type=feature_type, 
               title='Car horn ({})\nSpeed decrease {}%'.format(feature_type.upper(), int(perc*100)))

## Harmonic Distortion

Sine function applied to signal 5 times.

In [54]:
h_hd = sp.augment.harmonic_distortion(h, sr=sr)

In [55]:
ipd.Audio(h_hd,rate=sr)

In [56]:
sp.plotsound(h_hd, sr=sr, feature_type=feature_type, 
               title='Car horn ({})\nHarmonic Distortion'.format(feature_type.upper()))

## Pitch Shift


In [57]:
num_semitones = 2

### pitch shift increase

In [58]:
h_psi = sp.augment.pitch_increase(h, sr=sr, num_semitones = num_semitones)

In [59]:
ipd.Audio(h_psi,rate=sr)

In [60]:
sp.plotsound(h_psi, sr=sr, feature_type=feature_type, 
               title='Car horn ({})\npitch shift increase of {} semitones'.format(feature_type.upper(), num_semitones))

### pitch shift decrease

In [61]:
h_psd = sp.augment.pitch_decrease(h, sr=sr, num_semitones = num_semitones)

In [62]:
ipd.Audio(h_psd,rate=sr)

In [63]:
sp.plotsound(h_psd, sr=sr, feature_type=feature_type, 
               title='Car horn ({})\npitch shift decrease of {} semitones'.format(feature_type.upper(), num_semitones))