In our project, we implement a class of perfect-reconstruction filterbanks for audio analysis. We use a GPU to solve embarassingly parallel subtasks so that the algorithm takes only $O\left(\log{N}\right)$ time steps. We also introduce a novel type of time-frequency tiling which retains the benefits of constant-Q transforms without requiring their cumbersome data structure.

For audio signal processing problems such as compression, denoising, and various types of pattern recognition, deep learning has taken over. For example, the search interest for "wavelet" has followed the opposite trajectory as "convolutional neural network"

![](img/trends1.png)

A current trend of audio signal processing research is to apply 2D convolutional neural networks to time-frequency representations of audio. Although great progress was made in the pre-deep learning era on perfect reconstruction filterbanks and wavelet transforms, most of the algorithms and techniques have been abandoned by practitioners of deep learning because no standardized and free implementations were ever widely disseminated. As a result, the standarized set of triangular "Mel" filterbanks dating back nearly 50 years appears to have remained the dominant tool, and have even seen a resurgance despite their impending obsolescence.

![](img/trends2.png)

The use of these methods is accumulating a considerable debt. Because they are based on windowed FFTs, the time resolution is the same for all frequencies. This causes the representation to be overly redundant for low frequencies while undersampled for high frequencies As a result, they lack the perfect reconstruction property and have an inconvenient phase representation. They are difficult to generalize to multiple channels, phased arrays, or higher dimensional transforms. 

Filter banks are the oldest method of time-frequency analysis, dating back to the first analog spectrum analyzer built by Hermann von Helmholtz in the 19th century.

![](img/helmholtz.png)

In 1909, mathematician Alfréd Haar discovered that a list of $2^n$ numbers can be represented by recursively taking the sum and difference of adjacent pairs. This property is now commonly referred to 'alias cancellation'

In the 1970s, engineers created the first discrete, invertible filterbanks by expoiting this property. Using what are known as 'conjugate mirror filters'. In the following decade, a massive research effort took place, resulting in a rigorous mathematical understanding of these processes and the development of various "wavelet transforms."

The Haar decomposition is a simple type of perfect reconstruction filter bank and is the basis of our algorithm.

For simplicity, we've taken the input samples to be integers it works with any type of signal.

We apply two filters using convolution. One filter is low-pass and the other is high pass.

You can see that the sum and difference not only decompose the signal into high and low frequency components, but they do so without increasing the number of words required to represent the signal.

This is an example of “critically sampled” dyadic filterbank. Critically sampled means that the size of data after applying the filter bank is exactly same as original data, and no information is lost. "Dyadic" just refers to the fact that there are two filters in the filter bank.

We call the first part an analysis filterbank, since it is decomposing a 1-d signal into multiple frequency bands.

We call the reverse operation a synthesis filterbank since it reconstructs the 1-d signal from the frequency bands.

In this example, we end up with two signals that have half the time resolution. But, we went from having only one frequeny band to two, so we have doubled our frequency resolution.

In our algorithm, we recursively apply analysis filters to decompose the low frequency band until we have 10 octaves of resolution (plus a DC band)

Then for each octave, we recursively apply the analysis filters but now we decompose both the low-frequency and high-frequency components.

Exploit highly optimized filtering operations in CUDA to make each convolution $O(log M)$

We also implemented the corresponding synthesis filterbank.

If we recursively apply this procedure, we refine the time-frequency tiling of our representation.

Depending on how we structure the tree, we can either get a uniform tiling (similar to STFT), or a constant-Q tiling.

The structure on the left has a simple convenient data structure: we can represent each tile as the element of a matrix.

The structure on the right is often a better representation of the underlying signal structure, and provides good time resolution at the places we need and good frequency resolution in the places we need.

What if we could have the best of both worlds?

That's exactly what we did!

$ y = x \ast h = 
\begin{bmatrix}
    x_1 & 0 & \cdots & 0 & 0 \\
    x_2 & x_1 &      & \vdots & \vdots \\
    x_3 & x_2 & \cdots & 0 & 0 \\
    \vdots & x_3 & \cdots & x_1 & 0 \\
    x_{n-1} & \vdots & \ddots & x_2 & x_1 \\
    x_n & x_{n-1} &      & \vdots & x_2 \\
    0 & x_n & \ddots & x_{n-2} & \vdots \\
    0 & 0 & \cdots & x_{n-1} & x_{n-2} \\
    \vdots & \vdots &        & x_n & x_{n-1} \\
    0 & 0 & 0 & \cdots & x_n
\end{bmatrix}
$
$\begin{bmatrix}
    0 \\
    0 \\
    \vdots\\
    h_1 \\
    h_2 \\
    \vdots \\
    h_m \\
    0 \\
    \vdots \\
    0 \\
\end{bmatrix}$


Since the core computation of our algorithm is applying filters to a digital signal (i.e. a sequence of discrete numbers), the major computation burden is convolution.
Each analysis filter and synthesis filterbanks consist of recursive filtering, followed by or following by downsampling and upsampling, respectively. Convolving a signal with a digital filter consists of a number of dot products, which are all independent. Specifically, for a sequence of length $N$ and filter of length $M$, there are $N$ parallelizable dot products, and each dot products takes $O(log(M))$ time to add up all products in dot product using parallel reduce.
CuDNN has highly optimized convolution implmentation with support for various parameters such as padding, stride, and dilations. By combining them we can perform each filtering in our filterbank efficiently. 

We tested how our algorithm scales with different data size, from about one second of music to one hour of film.
This matches with the derived complexity of our algorithm, as scaling input by $10^4$ times takes $10^2$ time. We can also see that the analysis filter and synthesis filter scales in exactly same way. Synthesis, or the invert operation takes slightly longer because it has addition terms.

Next, we compared our algorithm with Librosa, which is a widely-used audio processing library. It's algorithm is not optimized for similar task, and as Dan explained earlier it uses non reconstractbale filters so it does some unnecessary operations (like Griffinlim). We can clearly see speedup, up to about 200 times. This is a big deal, because when people use filterbank as a preprocessing for large scale dataset for deep learning, because some datasets have 100s of 1000s of hours, the waiting time for preprocessing will be dramatically reduced.

In [62]:
run(`jupyter-nbconvert --to markdown written_report.ipynb --TagRemovePreprocessor.remove_cell_tags='{"remove_cell"}'`);
run(`mv written_report.md written_report.tex`);
run(`pdflatex written_report`);
run(`biber written_report`);
run(`pdflatex written_report`);

This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020/Debian) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
(./written_report.tex
LaTeX2e <2020-10-01> patch level 4
L3 programming layer <2021-01-09> xparse <2020-03-03> (./llncs.cls
Document Class: llncs 2018/03/10 v2.20 
 LaTeX document class for Lecture Notes in Computer Science
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2020/04/10 v1.4m Standard LaTeX document class
(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))
(/usr/share/texlive/texmf-dist/tex/latex/tools/multicol.sty)
(/usr/share/texlive/texmf-dist/tex/latex/oberdiek/aliascnt.sty))
(/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty)
(/usr/share/texlive/texmf-dist/tex/latex/setspace/setspace.sty)
(/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty
(/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty))
(/usr/share/texlive/texmf-dist/tex/latex/subfiles/subfiles.s

  warn(
[NbConvertApp] Converting notebook written_report.ipynb to markdown
[NbConvertApp] Writing 28542 bytes to written_report.md
