A major concern with the current trend in deep learning (DL) systems is the high cost in terms of (1) computational resources need to train models, (2) amount of data required, and (3) design complexity. These costs are further compounded for multimodal applications which stand to be some of the most impactful. These barriers increase the risk that the incredible power and utility of these systems are realized only by a handful of large companies and organizations and are therefore less accessible and auditable to the wider economy and society.

Fortunately, conventional deep learning procedures for audio, images, and video exhibit inefficiencies which remain to be exploited to address these barriers. For example, consider the benchmark task of object recognition using the ImageNet dataset. Although the data are stored using a lossy compression format (JPEG) to exploit the underlying sparsity of natural images, the conventional DL input pipeline requires full decoding, resampling to uniform size, and use of high precision floating point arithmetic, resulting in a massive expansion in number of bits used to represent each image by the time it reaches the first layer of a neural network, as depicted in Fig ~\ref{fig:CNN}. The complexity and inefficiencies in this pipeline prevent the embarrassingly parallel nature of DL training to be fully exploited by power efficient accelerators and coprocessors which offer incredible computational performance per watt and per dollar.

Even more inefficiencies can be observed in the relatively neglected study of learning from acoustic signals, particularly in the case of spatial, multi-channel, infrasound, and ultrasound recordings. DL approaches for acoustic signals borrow successful techniques from speech processing and image recognition which are suboptimal in these contexts. For example, time-frequency representations (TFRs) are ubiquitous for their ability to reveal the underlying signal sparsity. However, commonly used representations borrowed from speech processing like the Mel spectrogram either discard high frequency information or exhibit redundancy at low frequencies and entirely discard phase to create an image-like matrix. Though this allows convenient adaptation of image processing architectures, extension beyond stereo, indoor, and audio frequencies is difficult.

Two emerging DL technologies, learned data compression (LDC) \cite{introNDC} and self-supervised learning (SSL), present an opportunity to massively reduce the complexity of DL pipelines and multimodal learning. Algorithmic information theory as well as the efficient coding hypothesis suggest a potential for strong connections between these techniques which are yet to be understood or exploited.

We propose the study of these techniques under a single umbrella, with a focus on generalizability to all modalities of natural signals from low frequency sonar arrays to electro-optical/infrared satellite images. We propose building interoperable training pipelines for (1) lossy LDC, (2) pretraining, (3) fine-tuning, and (4) domain adaptation. Initial experiments indicate opportunities to increase training throughput and simplify procedures for performing multimodal and few-shot learning.

In [5]:
!jupyter nbconvert \
--to markdown proposal.ipynb \
--TagRemovePreprocessor.enabled=True \
--TagRemovePreprocessor.remove_cell_tags='remove_cell' \
--TagRemovePreprocessor.remove_input_tags='remove_input'
!mv proposal.md proposal.tex

!pdflatex --shell-escape proposal
!bibtex proposal
!pdflatex --shell-escape proposal
!pdflatex --shell-escape proposal

[NbConvertApp] Converting notebook proposal.ipynb to markdown
[NbConvertApp] Writing 4693 bytes to proposal.md
This is pdfTeX, Version 3.141592653-2.6-1.40.22 (TeX Live 2022/dev/Debian) (preloaded format=pdflatex)
 \write18 enabled.
entering extended mode
(./proposal.tex
LaTeX2e <2021-11-15> patch level 1
L3 programming layer <2022-01-21>
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2021/10/04 v1.4n Standard LaTeX document class
(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))
(/usr/share/texlive/texmf-dist/tex/latex/svg/svg.sty
(/usr/share/texlive/texmf-dist/tex/generic/iftex/iftex.sty)
(/usr/share/texlive/texmf-dist/tex/latex/koma-script/scrbase.sty
(/usr/share/texlive/texmf-dist/tex/latex/koma-script/scrlfile.sty
(/usr/share/texlive/texmf-dist/tex/latex/koma-script/scrlfile-hook.sty
(/usr/share/texlive/texmf-dist/tex/latex/koma-script/scrlogo.sty)))
(/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty))
(/usr/share/texlive/texmf