<a href="https://colab.research.google.com/github/satyajitghana/TSAI-DeepVision-EVA4.0-Phase-2/blob/master/13-AI4Sound/WavAugment_walkthrough.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WavAugment walkthrough

In this colab document, we will go through some basic functionality that WavAugment provides. We will
*  install some required packages,
*  show how to apply simple augmentations on a speech sequence,
*  how to combine and randomize them,
*  discuss some useful considerations and limitations.

Our overall target is cover most of the things that we found useful for deep self-supervised learning.

## Prerequisites

In Google Colab we don't have sox pre-installed, so we have to install it first; next, we install torchaudio and WavAugment.

In [None]:
!apt-get install libsox-fmt-all libsox-dev sox > /dev/null
! python -m pip install torchaudio > /dev/null
! python -m pip install git+https://github.com/facebookresearch/WavAugment.git > /dev/null

  Running command git clone -q https://github.com/facebookresearch/WavAugment.git /tmp/pip-req-build-79qy7_3f


In case you want to input your own voice via your microphone, you also need need to install ffmpeg-python, too:

In [None]:
!pip install ffmpeg-python > /dev/null

We will also need an audio file with speech for our demonstrations. 

We have two options here: either we take a file from the [LibriSpeech-100 dataset](http://www.openslr.org/12), that WavAugment uses for tests; or you can record your own voice.

In [None]:
import torchaudio
# Download example from WavAugment
! wget https://raw.githubusercontent.com/facebookresearch/WavAugment/master/tests/test.wav > /dev/null

# and load it as a tensor
x, sr = torchaudio.load('test.wav')

--2020-07-27 13:10:00--  https://raw.githubusercontent.com/facebookresearch/WavAugment/master/tests/test.wav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 454764 (444K) [audio/wav]
Saving to: ‘test.wav’


2020-07-27 13:10:01 (5.44 MB/s) - ‘test.wav’ saved [454764/454764]



Alternatively, we need to define an utility function for grabbing microphone input:

In [None]:
# code taken from https://ricardodeazambuja.com/deep_learning/2019/03/09/audio_and_video_google_colab/
from IPython.display import HTML, Audio
from google.colab.output import eval_js
from base64 import b64decode
import numpy as np
import io
import ffmpeg
import tempfile
import pathlib


AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_p = document.createElement("P");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");

my_btn.appendChild(t);
//my_p.appendChild(my_btn);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);

var base64data = 0;
var reader;
var recorder, gumStream;
var recordButton = my_btn;

var handleSuccess = function(stream) {
  gumStream = stream;
  var options = {
    //bitsPerSecond: 8000, //chrome seems to ignore, always 48k
    mimeType : 'audio/webm;codecs=opus'
    //mimeType : 'audio/webm;codecs=pcm'
  };            
  //recorder = new MediaRecorder(stream, options);
  recorder = new MediaRecorder(stream);
  recorder.ondataavailable = function(e) {            
    var url = URL.createObjectURL(e.data);
    var preview = document.createElement('audio');
    preview.controls = true;
    preview.src = url;
    document.body.appendChild(preview);

    reader = new FileReader();
    reader.readAsDataURL(e.data); 
    reader.onloadend = function() {
      base64data = reader.result;
      //console.log("Inside FileReader:" + base64data);
    }
  };
  recorder.start();
  };

recordButton.innerText = "Recording... press to stop";

navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);


function toggleRecording() {
  if (recorder && recorder.state == "recording") {
      recorder.stop();
      gumStream.getAudioTracks()[0].stop();
      recordButton.innerText = "Saving the recording... pls wait!"
  }
}

// https://stackoverflow.com/a/951057
function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

var data = new Promise(resolve=>{
//recordButton.addEventListener("click", toggleRecording);
recordButton.onclick = ()=>{
toggleRecording()

sleep(2000).then(() => {
  // wait 2000ms for the data to be available...
  // ideally this should use something like await...
  //console.log("Inside data:" + base64data)
  resolve(base64data.toString())

});

}
});
      
</script>
"""

def get_audio():
  display(HTML(AUDIO_HTML))
  data = eval_js("data")
  binary = b64decode(data.split(',')[1])
  
  process = (ffmpeg
    .input('pipe:0')
    .output('pipe:1', format='wav')
    .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True, quiet=True, overwrite_output=True)
  )
  output, err = process.communicate(input=binary)
  
  riff_chunk_size = len(output) - 8
  # Break up the chunk size into four bytes, held in b.
  q = riff_chunk_size
  b = []
  for i in range(4):
      q, r = divmod(q, 256)
      b.append(r)

  # Replace bytes 4:8 in proc.stdout with the actual size of the RIFF chunk.
  riff = output[:4] + bytes(b) + output[8:]

  with tempfile.TemporaryDirectory() as tmpdirname:
    path = pathlib.Path(tmpdirname) / 'tmp.wav'
    with open(path, 'wb') as f:
       f.write(riff)
       
    x, sr = torchaudio.load(path)

  return x, sr

Now record a piece of speech, if you prefer:

In [None]:
x, sr = get_audio()

Now we are all set!

## Applying simple and useful augmentations

Let's import everything we will need.

In [None]:
import torch
import augment
import numpy as np

import IPython.display as ipd

Let's load the snippet of the speech and listen to it:

In [None]:
print(f'We loaded a speech example; sample rate: {sr}, number of channels: {x.size(0)}, its length is {x.size(1)} frames or about {x.size(1) // sr} seconds.')
ipd.Audio(x, rate=sr)

We loaded a speech example; sample rate: 48000, number of channels: 1, its length is 77760 frames or about 1 seconds.


Similarly to `sox`, the central entity of WavAugment is a sequence of effects, `augment.EffectChain`. As the name indicates, we can create various combinations of audio effects by chaining them together. This chain can be empty and do nothing:


In [None]:
empty_chain = augment.EffectChain()
y = empty_chain.apply(x, src_info={'rate': sr})

or can contain one or more effects. Let us create a chain that applies a clipping effect:

In [None]:
clip_chain = augment.EffectChain().clip(0.25)

In [None]:
y = clip_chain.apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

We can append effects one after another, just like below where we put `rate` transformer after the `pitch` one:

In [None]:
y = augment.EffectChain().pitch(-200).rate(sr) \
  .apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

Here, we have lowered the pitch of the voice by 2 tones: -200 indicates that we'll go lower by 200 cents of the tone.

Similarly, we can go up by the same amount:

In [None]:
y = augment.EffectChain().pitch(200).rate(sr) \
  .apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

Why do we to put `rate` after pitch? At the moment, WavAugment's `pitch` provides a somewhat thin wrapper around the corresponding effect of `libsox` [*]. Internally, `libsox` would represent change in the pitch as combination of `tempo` and `rate` effects; so for the time being we need to change the rate back manually.

[*] This is subject to change in the future, as we re-iterate on the library.


Another effect that we found useful is `reverb`. The reverberations that are provided by `sox` are specified by three parameters: reverberance, dumping factor, and room size. Check how it sounds:

In [None]:
y = augment.EffectChain().reverb(50, 50, 50).channels(1).apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

Again, we need to add the `channels` effect due to pecularities of `libsox`.

What else can we do? Another effect that is often used in the literature, is replacing a small span of audio with silence. We can do that, too:

In [None]:
y = augment.EffectChain().time_dropout(max_seconds=0.5).apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

Applying additive noise is a bit more involved, as we need a database of noise, such as [MUSAN](https://www.openslr.org/17/). For the sake of this small tutorial, we will use generated uniform noise. The additive noise effect consumes a Callable that returns the noise to be added: 


In [None]:
noise_generator = lambda: torch.zeros_like(x).uniform_()

In [None]:
y = augment.EffectChain().additive_noise(noise_generator, snr=15).apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

WavAugment does not normalize the inputs, neither noise nor the input tensor, this needs to be kept in mind.

In terms of sox effects, bandreject can be implemented as follows:

In [None]:
y = augment.EffectChain().sinc('-a', '120', '500-100').apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

Still comprehensible 🙂

## Randomization & Combining

So in data augmentation we typically want to randomize the applied augmentation and/or its strength. All effects in WavAugment take a Callable as any of its parameters, which provides a way randomize the applied effect. For instance, we can randomize pitch as follows:

In [None]:
random_pitch_shift = lambda: np.random.randint(-400, +400)
# the pitch will be changed by a shift somewhere between (-400, +400)
random_pitch_shift_effect = augment.EffectChain().pitch("-q", random_pitch_shift).rate(sr)
# -q flag enables faster, but lower quality processing

Applying it twice, we'll get different results:

In [None]:
y = random_pitch_shift_effect.apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

In [None]:
y = random_pitch_shift_effect.apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

If an effect has several parameters, we can replace all or some of them:

In [None]:
random_room_size = lambda: np.random.randint(0, 101)
random_reverb = augment.EffectChain().reverb(50, 50, random_room_size).channels(1)

y = random_reverb.apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

In [None]:
y = random_reverb.apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

We can easily stack augmentations:

In [None]:
combination = augment.EffectChain() \
  .pitch("-q", random_pitch_shift).rate(sr) \
  .reverb(50, 50, random_room_size).channels(1) \
  .additive_noise(noise_generator, snr=15) \
  .time_dropout(max_seconds=1.0)
y = combination.apply(x, src_info={'rate': sr}, target_info={'rate': sr})
ipd.Audio(y, rate=sr)

## Discussion & Limitations

*  Currently, all augmentations are non-batched (and done on CPU). Hence, it's a good idea to apply them inside a parallelized dataloader (see our example [example](https://github.com/facebookresearch/WavAugment/blob/master/examples/python/librispeech_selfsupervised.py)),
* In some corner cases, `pitch` augmentation within libsox might return `NaN`. If this happens, it can be useful to handle this case (as we do [here](https://github.com/facebookresearch/WavAugment/blob/master/examples/python/librispeech_selfsupervised.py#L118)),
* To interpret what sox-based effects do and which parameters they take, please consult the sox [documentation](http://sox.sourceforge.net/sox.html). All effects apart from additive noise, time dropout, and clipping are based on sox,
* The full list of 64 supported effects is:

In [None]:
augment.EffectChain.KNOWN_EFFECTS

{'additive_noise',
 'allpass',
 'band',
 'bandpass',
 'bandreject',
 'bass',
 'bend',
 'biquad',
 'channels',
 'chorus',
 'clip',
 'compand',
 'contrast',
 'dcshift',
 'deemph',
 'delay',
 'dither',
 'divide',
 'downsample',
 'earwax',
 'echo',
 'echos',
 'equalizer',
 'fade',
 'firfit',
 'flanger',
 'gain',
 'highpass',
 'hilbert',
 'input',
 'ladspa',
 'loudness',
 'lowpass',
 'mcompand',
 'noisered',
 'norm',
 'oops',
 'output',
 'overdrive',
 'pad',
 'phaser',
 'pitch',
 'rate',
 'remix',
 'repeat',
 'reverb',
 'reverse',
 'riaa',
 'silence',
 'sinc',
 'speed',
 'stat',
 'stats',
 'stretch',
 'swap',
 'synth',
 'tempo',
 'time_dropout',
 'treble',
 'tremolo',
 'trim',
 'upsample',
 'vad',
 'vol'}

Overall, we hope to extend and improve the library.