# Audio classification the from-scratch way

Thanks to the SF Study Group practitioners: @aamir7117, @marii, @simonjhb, @ste, @ThomM, @zachcaceres.

We're going to demonstrate the technique of classifying audio samples by first converting the audio into spectrograms, then treating the spectrograms as images. Once we've converted the spectrograms to images, the workflow is just the same as using imagenette or any other image classification task.

What do we need to do?
* Download the data
* Load the data 
* Transform the data into spectrograms
* Load the audio data into a databunch such that we can use our previously-defined `learner` object

Still to come - data augmentations for audio, 1D convolutional models, RNNs with audio… and more, with your contribution :)

### Setup & imports

In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

We rely heavily on [torchaudio](https://github.com/pytorch/audio) - which you'll have to compile to install.

In [None]:
#export
from exp.nb_12c import *

import torchaudio
from torchaudio import transforms

In [None]:
#export
AUDIO_EXTS = {str.lower(k) for k,v in mimetypes.types_map.items() if v.startswith('audio/')}

### Download

This should be one line; it's only so complicated because the target .tgz file doesn't extract itself to its own directory.

In [None]:
dsid = "ST-AEDS-20180100_1-OS"
data_url = f'http://www.openslr.org/resources/45/{dsid}' # actual URL has .tgz extension but untar_data doesn't like that
path = Path.home() / Path(f".fastai/data/{dsid}/")
datasets.untar_data(data_url, dest=path)
path

## Loading into an AudioList

Getting a file list the `08_data_block` way.

The "manual" way using `get_files`…

In [None]:
audios = get_files(path, extensions=AUDIO_EXTS)
print(f"Found {len(audios)} audio files")
audios[:5]

…But that's not very exciting. Let's make an `AudioList`, so we can use transforms, and define how to `get` an Audio.

### AudioList

In [None]:
#export
class AudioList(ItemList):
    @classmethod
    def from_files(cls, path, extensions=None, recurse=True, include=None, **kwargs):
        if extensions is None: extensions = AUDIO_EXTS
        return cls(get_files(path, extensions, recurse=recurse, include=include), path, **kwargs)
    
    def get(self, fn): 
        sig, sr = torchaudio.load(fn)
        assert sig.size(0) == 1, "Non-mono audio detected, mono only supported for now."
        return (sig, sr)

In [None]:
al = AudioList.from_files(path); al

It looks like this is full of file paths, but that's just the `repr` talking. Actually accessing an item from the list calls the `get` method and returns a `(Tensor, Int)` tuple representing the signal & sample rate.

In [None]:
al[0]

## Splitting into train/validation

Our data is all in one folder, there's no specific validation set, so let's just split it at random.

In [None]:
sd = SplitData.split_by_func(al, partial(random_splitter, p_valid=0.2))

In [None]:
sd

## Labeling

Our labels are encoded in our filenames. For example, `m0003_us_m0003_00032.wav` has the label `m0003`. Let's make a regex labeler, then use it.

In [None]:
#export
def re_labeler(fn, pat): return re.findall(pat, str(fn))[0]

In [None]:
label_pat = r'/([mf]\d+)_'
speaker_labeler = partial(re_labeler, pat=label_pat)
ll = label_by_func(sd, speaker_labeler, proc_y=CategoryProcessor())

In [None]:
ll

## Transforms: audio clipping & conversion to spectrograms

The pytorch dataloader needs to be all tensors to be the same size, but our input audio files are of different sizes, so we need to trim them. Also, recall that we're not going to send the model the _audio_ directly; we're going to convert it to spectrograms first. We can treat these steps as transforms. In particular, the `_order` property makes this simple.

### toCuda

The other transforms both use all-tensor ops, so it should help. Let's try it out.

In [None]:
#export
class ToCuda(Transform):
    _order=10
    def __call__(self, ad):
        sig,sr=ad
        return (sig.cuda(), sr)

In [None]:
ToCuda()(ll.train[0][0])

### PadOrTrim

`torchaudio` has one for this already; all we're doing is taking an argument in milliseconds rather than frames.

In [None]:
#export
class PadOrTrim(Transform):
    _order=11
    def __init__(self,msecs):
        self.msecs = msecs
        
    def __call__(self, ad): 
        sig, sr = ad
        mx = sr//1000 * self.msecs
        return (transforms.PadTrim(mx)(sig), sr)

Small helper to show some audio.

In [None]:
#export
from IPython.display import Audio
def show_audio(ad):
    sig,sr=ad
    return Audio(data=sig, rate=sr)

*Note - this won't work if you've already run the notebook all the way through, because `ll` now contains Tensors representing Spectrograms, not `(Signal, SampleRate)` tuples.*

In [None]:
show_audio(ll.train[0][0])

In [None]:
pt = PadOrTrim(3000) ## duration in milliseconds
show_audio(pt(ll.train[0][0]))

### Spectrogram

Luckily, `torchaudio` takes care of the calculation & conversion to Spectrograms for us.

Instead of clipping our audio, we could also modify our spectrogram transform to ensure all the final spectrograms had the same shape, but that's a little more complicated, as the size of the spectrograms is a function of the length of the clip; so we'd have to calculate a `n_mels` and/or `ws` (window_size) param independently per clip.

In [None]:
#export
class Spectrogrammer(Transform):
    _order=90
    def __init__(self, to_mel=True, to_db=True, n_fft=400, ws=None, hop=None, 
                 f_min=0.0, f_max=None, pad=0, n_mels=128, top_db=None, normalize=False):
        self.to_mel, self.to_db, self.n_fft, self.ws, self.hop, self.f_min, self.f_max, \
        self.pad, self.n_mels, self.top_db, self.normalize = to_mel, to_db, n_fft, \
        ws, hop, f_min, f_max, pad, n_mels, top_db, normalize

    def __call__(self, ad):
        sig,sr = ad
        if self.to_mel:
            spec = transforms.MelSpectrogram(sr, self.n_fft, self.ws, self.hop, self.f_min, 
                                             self.f_max, self.pad, self.n_mels)(sig)
        else: 
            spec = transforms.Spectrogram(self.n_fft, self.ws, self.hop, self.pad, 
                                          normalize=self.normalize)(sig)
        if self.to_db:
            spec = transforms.SpectrogramToDB(top_db=self.top_db)(spec)
        spec = spec.permute(0,2,1)
        return spec

In [None]:
speccer = Spectrogrammer(to_db=True, n_fft=1024, n_mels=64, top_db=80)

Small helper to show a spectrogram.

In [None]:
#export
def show_spectro(img, ax=None, figsize=(6,6), with_shape=True):
    if hasattr(img,"device") & str(img.device).startswith("cuda"): img = img.cpu()
    if ax is None: _,ax = plt.subplots(1, 1, figsize=figsize)
    ax.imshow(img if (img.shape[0]==3) else img.squeeze(0))
    if with_shape: display(f'Tensor shape={img.shape}')

*Note - this won't work if you've already run the notebook all the way through, because `ll` now contains Tensors representing Spectrograms, not `(Signal, SampleRate)` tuples.*

In [None]:
show_spectro(speccer(ll.train[0][0]))

### Using the transforms

Now let's create the transforms with the params we want, and rebuild our label lists using them. 

Note that now the items in the final `LabelList` won't be tuples anymore, they'll just be tensors. This is convenient for actually using the data, but it means you can't really go back and listen to your audio anymore. We can probably find a way around this, but let's press on for now.

In [None]:
pad_3sec = PadOrTrim(3000)
speccer = Spectrogrammer(n_fft=1024, n_mels=64, top_db=80)

tfms = [ToCuda(), pad_3sec, speccer]

al = AudioList.from_files(path, tfms=tfms)
sd = SplitData.split_by_func(al, partial(random_splitter, p_valid=0.2))
ll = label_by_func(sd, speaker_labeler, proc_y=CategoryProcessor())

In [None]:
show_spectro(ll.train[4][0])

## Databunch

Now we've got our beautifully transformed tensors, let's add them into a databunch, so we can feed a model easily.

We can use our `get_dls` func which we defined in `03_minibatch_training`, but let's use the to_databunch func we defined in `08_data_block` instead, it's much nicer.

In [None]:
bs=64

c_in = ll.train[0][0].shape[0]
c_out = len(uniqueify(ll.train.y))

In [None]:
data = ll.to_databunch(bs,c_in=c_in,c_out=c_out)

Check the dataloader's batching functionality.

In [None]:
x,y = next(iter(data.train_dl))

In [None]:
#export
def show_batch(x, c=4, r=None, figsize=None, shower=show_image):
    n = len(x)
    if r is None: r = int(math.ceil(n/c))
    if figsize is None: figsize=(c*3,r*3)
    fig,axes = plt.subplots(r,c, figsize=figsize)
    for xi,ax in zip(x,axes.flat): shower(xi, ax)

In [None]:
show_spec_batch = partial(show_batch, c=4, r=2, figsize=None, 
                          shower=partial(show_spectro, with_shape=False))

In [None]:
show_spec_batch(x)

Looking good.

## Training

Go for gold! As a proof of concept, let's use the *pièce de résistance* learner builder with the hyperparameters from Lesson 11 `11_train_imagenette`.

In [None]:
opt_func = adam_opt(mom=0.9, mom_sqr=0.99, eps=1e-6, wd=1e-2)
loss_func = LabelSmoothingCrossEntropy()
lr = 1e-2
pct_start = 0.5
phases = create_phases(pct_start)
sched_lr  = combine_scheds(phases, cos_1cycle_anneal(lr/10., lr, lr/1e5))
sched_mom = combine_scheds(phases, cos_1cycle_anneal(0.95,0.85, 0.95))
cbscheds = [ParamScheduler('lr', sched_lr), 
            ParamScheduler('mom', sched_mom)]

In [None]:
learn = cnn_learner(xresnet34, data, loss_func, opt_func)

In [None]:
learn.fit(5, cbs=cbscheds)

## Demo - all at once

This is all the code it takes to do it end-to-end (not counting the `#export` cells above).

In [None]:
# dsid = "ST-AEDS-20180100_1-OS"
# data_url = f'http://www.openslr.org/resources/45/{dsid}' # actual URL has .tgz extension but untar_data doesn't like that
# path = Path.home() / Path(f".fastai/data/{dsid}/")
# datasets.untar_data(data_url, dest=path)

# pad_3sec = PadOrTrim(3000)
# speccer = Spectrogrammer(n_fft=1024, n_mels=64, top_db=80)

# tfms = [pad_3sec, speccer]

# al = AudioList.from_files(path, tfms=tfms)

# sd = SplitData.split_by_func(al, partial(random_splitter, p_valid=0.2))

# label_pat = r'/([mf]\d+)_'
# speaker_labeler = partial(re_labeler, pat=label_pat)
# ll = label_by_func(sd, speaker_labeler, proc_y=CategoryProcessor())

# bs=64
# c_in = ll.train[0][0].shape[0]
# c_out = len(uniqueify(ll.train.y))

# data = ll.to_databunch(bs,c_in=c_in,c_out=c_out)

# opt_func = adam_opt(mom=0.9, mom_sqr=0.99, eps=1e-6, wd=1e-2)
# loss_func = LabelSmoothingCrossEntropy()
# lr = 1e-2
# pct_start = 0.5
# phases = create_phases(pct_start)
# sched_lr  = combine_scheds(phases, cos_1cycle_anneal(lr/10., lr, lr/1e5))
# sched_mom = combine_scheds(phases, cos_1cycle_anneal(0.95,0.85, 0.95))
# cbscheds = [ParamScheduler('lr', sched_lr), 
#             ParamScheduler('mom', sched_mom)]

# learn = cnn_learner(xresnet34, data, loss_func, opt_func)
# learn.fit(5, cbs=cbscheds)

## Fin

In [None]:
nb_auto_export()