In [1]:
#hide
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# How to normalize spectrograms
> Scaling spectrograms as inputs to neural networks.  

- toc: true 
- badges: true
- comments: true
- categories: [normalizing_spectrums]
- image: images/violin_spec.png

# Introduction

Audio is naturally a 1D signal
Can transformed into 2D in several ways
Most common is via short-time Fourier Transform (STFT)
The STFT turns audio into a spectrogram: 2D representation in time and frequency
Since now 2D, can think of as an image
Means we can leverage all the great work from deep learning image classification


Hard to overstate success of deep neural network for image classification.
Previously challenging task dominated by handcrafted features
Now features automatically learned from labeled data
Possible due to improvements in datasets, algorithms, and compute


But, one tricky issue using spectrogram inputs
Data must be properly normalized when training neural network
Else learning is difficult for the networks and could take a very long time or even diverge.
A spectrogram, however, is fundamentally different from natural images.
So how can we properly normalize spectrograms?


# Background on initializations

well known how to scale images: stores as 0 to 255 range. Then converted to range 0 to 1
Find the statistics that approximately center the data and give it unit variance 
Spectrograms are created at a completely different range
Usually analyzed in the log domain where the possible range is -inf to +inf. 
In practice the values more constrained, but range is much larger than images (and negative).

Several details in how to normalize audio pop up.
How to normalize the waveform?
How to normalize the spectrogram?
How do we compute the normalization statistics?
How to deal with rapidly changing audio?

Will start with normalizing waveforms then move to spectrograms.

# Sample audio dataset


To make things practical, will apply normalization techniques to ongoing ESC-50 challenge hosted by fastai audio [fastaudio](https://github.com/fastaudio/fastaudio) library.
Challenge uses the ESC-50 dataset for sound classification
ESC-50 has a wide range of sounds, will give us a good feel for how varied spectrograms can be


The fastaudio library is rapidly changing and improving so instructions might be different, but will aim to keep post updated.
Many lines below are based on the [baseline results notebook](https://github.com/fastaudio/Audio-Competition/blob/master/ESC-50-baseline-1Fold.ipynb) for convenience. 


## Importing `fastai` and `fastaudio` modules

In [None]:
from fastai.vision.all import *
from fastaudio.core.all import *
from fastaudio.augment.all import *

## Downloading the ESC-50 dataset

In [None]:
# already included in fastaudio, can download with fastai's `untar_data`
path = untar_data(URLs.ESC50)

The ESC-50 data is inside of an `audio` folder. 
Can look inside to find many `.wav` files.

In [None]:
wavs = (path/"audio").ls()
wavs

We use the first waveform as an example for normalization. Can load an audio file using the `create` function of the `AudioTensor` class in `fastaudio`. This class wraps a `torch.Tensor` with added syntactic sugar. 

In [None]:
# create an AudioTensor from a file path
sample = AudioTensor.create(wavs[0])

Thanks to functionality in the `AudioTensor` class we can easily plot and even listen to our sample!

In [None]:
sample.show()

In [None]:
sample.hear()

Let's start by normalizing this waveform.

# Normalizing waveforms

The first step is normalizing the audio waveform.
We give it a mean of zero and unit variance as in the usual way: 

$$\text{norm_audio} = \frac{\text{audio} - mean(\text{audio})}{std(\text{audio})} $$

In [None]:
# normalize the waveform
norm_sample = (sample - sample.mean()) / sample.std()

Let's check if the mean is roughly 0 and the variance is roughly one

In [None]:
# checking if normalization worked
norm_sample.mean(),norm_sample.var()

Success! Let's wrap it back an `AudioTensor` for convenience. The sampling rate did not change so we pass in the sampling rate from the unnormalized waveform.

In [None]:
norm_sample = AudioTensor(norm_saple, sr=sample.sr)

Now that we normalized the audio we can convert it to a spectrogram.

# Extracting spectrograms from audio

We can now extract a spectrogram from the normalized audio.
The fastaudio library has a helpful way of converting `AudioTensor`s into Spectrograms by wrapping some parts of [`torchaudio`](https://pytorch.org/audio/).

In [None]:
# create a fastai Transform that converts audio into spectrograms
cfg = AudioConfig.BasicSpectrogram()
audio2spec = AudioToSpec.from_cfg(cfg)

The details of the spectrogram are not directly relevant here.
But you can take a look at the [spectrogram source code](https://pytorch.org/audio/_modules/torchaudio/functional.html#spectrogram) to see it basically does pre and post processing around a [torch.stft](https://pytorch.org/docs/stable/generated/torch.stft.html) call.
We can now transform our audio into a spectrogram and show it.

In [None]:
# extract and view our spectrogram
spec = audio2spec(norm_sample)
spec

This is a good time to compare the shapes of the audio vs. spectrogram to see how it went from one to two dimensions.

In [None]:
f'Audio shape: {norm_sample.shape} | Spectrogram shape: {spec.shape}'

# How do we normalize spectrograms?

As stated in the introduction, a spectrogram is fundamentally different from an image.


In an image, both dimensions are in the spatial domain and have the same units.
For a color image, we have three channels (RGB) and we normalize each one.
If the image has a single channel (grayscale) then we normalize it instead.
Given that both dimensions are in the same scale and domain same, and the general layout of natural images, it makes sense to normalize each channel with a single, global value.

In a spectrogram, one dimension represents time and the other represents frequency.
Different quantities, scales, and sizes.
Frequency dimension given by choice of FFT size. Sets our spectral resolution.
Time dimension given by length of our signal, FFT size, and window overlap. Sets our temporal resolution.

However, the spectrogram also introduces the notion of a different type of channel.
This makes "channel" an overloaded term for our purposes but it is still a crucial piece of the puzzle. 
The spectrogram transform can be interpreted as a "channelizer".
That is a fancy way to say that it takes the continuous frequency spectrum of our signal and chops it up into discrete bins, or channels. For example, consider a signal sampled at 16 kHz (typical for audio) where we take an STFT of size 512. Our spectrogram will have 512 channels where each one has a "bandwidth" of $$16 \ \text{kHz} \ \ / \ \ 512 \ \text{bins} = 31.25 \ \text{Hz per bin}$$

Even though these spectrum channels are different from the channels in an image, it raises the question: should we (or can we?) normalize an entire spectrum "image" with a single, global value? Or do we need to normalize each channel, as is done with images?  

There is no clear answer here, and your approach will likely depend both on the specifics of your problem and where your system will be deployed.
For example, if your are building a system that will be deployed in a similar environment as the training one, then it might make more sense to normalize by channels.
Your channel-based normalization statistics will follow the average noise floor and activity of the training data.
This motivation hold if you expect roughly the same patterns and distributions of activity once the system is deployed.
However, it will be critical to monitor the deployed environment and update the statistics as needed, else you slowly shift out of domain.

If your system will instead be used in a completely different environment, of which you have no knowledge, then the global statistics could be a better fit. While not as technically sound, your model won't be as surprised by radically new activity across different channels. 

Lastly, we also have issue of Transfer Learning. In Transfer Learning it is best-practice to normalize the new dataset with the statistics from the old dataset. In most cases that means normalizing with ImageNet statistics.
So if you are doing transfer learning, the easiest approach will be to use original stats. 
If your dataset is large enough that you are training from scratch, then the above applies.

# Global Spectrogram Normalization

We will start by using a single, global value to normalize the spectrograms
This is the same way images are normalized
Will treat spectrogram as single-channel image
Need to find a single mean and standard deviation to apply to to each image

Get it from training dataset.
This means stepping through our mini-batches and finding the mean and standard deviation for each batch.
Then, accumulate and average it over our training samples to get a "global" statistic. 
First, we need a way to accumulate these statistics over mini-batches. Will borrow from the very helpful guide [here](http://notmatthancock.github.io/2017/03/23/simple-batch-stat-updates.html)

One small detail: if your training dataset is large enough, you do not need to iterate through the entire thing.
It is often enough to sample only 10 to 20% of the samples for accurate statistics.
Since ESC-50 is small enough, we get statistics from the whole set.

The class below tracks our global mean and standard deviation across mini-batches.

In [None]:
class StatsRecorder:
    def __init__(self, red_dims=(0,2,3)):
        """Accumulates normalization statistics across mini-batches.
        ref: http://notmatthancock.github.io/2017/03/23/simple-batch-stat-updates.html
        """
        self.red_dims = red_dims # which mini-batch dimensions to average over
        self.nobservations = 0   # running number of seen observations

    def update(self, data):
        """
        data: ndarray, shape (nobservations, ndimensions)
        """
        # initialize stats and dimensions on first batch
        if self.nobservations == 0:
            self.mean = data.mean(dim=self.red_dims, keepdim=True)
            self.std  = data.std (dim=self.red_dims,keepdim=True)
            self.nobservations = data.shape[0]
            self.ndimensions   = data.shape[1]
        else:
            if data.shape[1] != self.ndimensions:
                raise ValueError('Data dims don't match prev observations.')
            
            # find mean of new mini batch
            newmean = data.mean(dim=self.red_dims, keepdim=True)
            newstd  = data.std(dim=self.red_dims, keepdim=True)
            
            # update number of observations
            m = self.nobservations * 1.0
            n = data.shape[0]

            # update running statistics
            tmp = self.mean
            self.mean = m/(m+n)*tmp + n/(m+n)*newmean
            self.std  = m/(m+n)*self.std**2 + n/(m+n)*newstd**2 +\
                        m*n/(m+n)**2 * (tmp - newmean)**2
            self.std  = torch.sqrt(self.std)
                                 
            # update total number of seen samples
            self.nobservations += n

In [None]:
By default, it will average the statistics over grayscale or RGB dimensions for images. 
The red_dims might look familiar from other Computer Vision normalization code. 
Later on we will normalize by spectrogram channels by passing a different `red_dims`. 

## Get normalization stats from training dataset

Now we have to iterate through our training dataset and find the global statistics. 
Setup follows the fastaudio ESC-50 baseline.

In [None]:
df = pd.read_csv(path/"meta"/"esc50.csv")
df.head()

def CrossValidationSplitter(col='fold', fold=1):
    "Split `items` (supposed to be a dataframe) by fold in `col`"
    def _inner(o):
        assert isinstance(o, pd.DataFrame), "ColSplitter only works when your items are a pandas DataFrame"
        col_values = o.iloc[:,col] if isinstance(col, int) else o[col]
        valid_idx = (col_values == fold).values.astype('bool')
        return IndexSplitter(mask2idxs(valid_idx))(o)
    return _inner

auds = DataBlock(blocks=(AudioBlock, CategoryBlock),  
                 get_x=ColReader("filename", pref=path/"audio"), 
                 splitter=CrossValidationSplitter(fold=1),
                 batch_tfms = [a2s],
                 get_y=ColReader("category"))
dbunch = auds.dataloaders(df, bs=64)
dbunch.show_batch()

Next we create our recorder and find the needed normalization statistics.

In [None]:
global_stats = StatsRecorder()
for idx,o in enumerate(iter(dbunch.train)):
    x,y = o
    global_stats.update(x)
global_mean,global_std = global_stats.mean,global_stats.std

Can check they are the same shape as typical grayscale normalization stats. With a single channel, we expect shape: `[1,1,1,1]`.

In [None]:
mean,mean.shape

In [None]:
std,std.shape

First, let's repeat this with new `red_dims` argument to find normalization stats for each spectrogram channel. The new red_dims tells the recorder to average over everything except the frequency axis.

In [None]:
channel_stats = StatsRecorder(red_dims=(0,1,3))
for idx,o in enumerate(iter(dbunch.train)):
    x,y = o
    channel_stats.update(x)
channel_mean,channel_std = channel_stats.mean,channel_stats.std

# Making Normalization transforms

First need a transform to normalize the audio as shown in the first section

In [None]:
class AudioNormalize(Transform):
    "Normalizes a single audio tensor."
    def encodes(self, x:AudioTensor): return (x-x.mean()) / x.std()

To normalize our spectrogram batches, we can reuse fastai's existing Normalize with different arguments

In [None]:
GlobalSpecNorm  = Normalize(global_mean,  global_std,  axes=(0,2,3))
ChannelSpecNorm = Normalize(channel_mean, channel_std, axes=(0,1,3))

# Training with different statistics

Follow the fastaudio baseline, and train each type of normalization for 10 epochs. 
Take averaged accuracy over five runs.

In [None]:
epochs = 10
num_runs = 5

## Baseline performance  

Before getting carried away with normalization, let's first find out where we stand.  
Set baseline without normalization.  
This is same loop as in baseline results notebook. 

In [None]:
auds = DataBlock(blocks=(AudioBlock, CategoryBlock),  
                 get_x=ColReader("filename", pref=path/"audio"), 
                 splitter=CrossValidationSplitter(fold=1), 
                 batch_tfms = [a2s],
                 get_y=ColReader("category"))
dbunch = auds.dataloaders(df, bs=64)

accuracies = []
for i in range(num_runs):
    learn = cnn_learner(dbunch, 
                    resnet18, 
                    normalize=False,
                    config=cnn_config(n_in=1),
                    loss_fn=CrossEntropyLossFlat,
                    metrics=[accuracy]                 
                    )
    learn.fit_one_cycle(epochs)
    accuracies.append(learn.recorder.values[-1][-1])

print(f'Average accuracy without any normalization: {sum(accuracies) / num_runs}')



## Performance with global statistics

In [None]:
auds = DataBlock(blocks=(AudioBlock, CategoryBlock),  
                 get_x=ColReader("filename", pref=path/"audio"), 
                 splitter=CrossValidationSplitter(fold=1),
                 item_tfms = [AudioNormalize], 
                 batch_tfms = [a2s, GlobalSpecNorm],
                 get_y=ColReader("category"))
dbunch = auds.dataloaders(df, bs=64)


accuracies = []
for i in range(num_runs):
    # make cnn learner
    learn = cnn_learner(dbunch, 
                        resnet18, 
                        config=cnn_config(n_in=1),
                        loss_fn=CrossEntropyLossFlat,
                        metrics=[accuracy])
    # fit one cycle for given epochs
    learn.fit_one_cycle(epochs)
    accuracies.append(learn.recorder.values[-1][-1])

print(sum(accuracies) / num_runs)

print(f'Average accuracy for "global" normalization: {sum(accuracies) / num_runs}')


## Performance with channel statistics

In [None]:
auds = DataBlock(blocks=(AudioBlock, CategoryBlock),  
                 get_x=ColReader("filename", pref=path/"audio"), 
                 splitter=CrossValidationSplitter(fold=1),
                 item_tfms = [AudioNormalize], 
                 batch_tfms = [a2s, ChannelSpecNorm],
                 get_y=ColReader("category"))
dbunch = auds.dataloaders(df, bs=64)


accuracies = []
for i in range(num_runs):
    # make cnn learner
    learn = cnn_learner(dbunch, 
                        resnet18, 
                        config=cnn_config(n_in=1),
                        loss_fn=CrossEntropyLossFlat,
                        metrics=[accuracy])
    # fit one cycle for given epochs
    learn.fit_one_cycle(epochs)
    accuracies.append(learn.recorder.values[-1][-1])

print(f'Average accuracy for "channel" normalization: {sum(accuracies) / num_runs}')
