# Transfer Learning from Lossy Codecs
[Slides](https://danjacobellis.github.io/SYSML/proposal.slides.html)

<script>
    document.querySelector('head').innerHTML += '<style>.slides { zoom: 1.75 !important; }</style>';
</script>

<center> <h1>
Transfer Learning from Lossy Codecs
</h1> </center>

&nbsp;

<center> <h2>
Dan Jacobellis
</h2> </center>

## Lossy compression

* Most data are stored using lossy formats (MP3, JPEG)
* 1-4 bit subband quantization is typical
* ~1.5 bits per sample/pixel after entropy coding

<p style="text-align:center;">
<img src="_images/lossy_lossless.png" width=700 height=700 class="center">
</p>

![](img/lossy_lossless.png)

## Conventional training procedure

* Still suffers from all of the downsides of lossy compression
* Don't get any of the benefits of smaller representation!

<p style="text-align:center;">
<img src="_images/conventional.png" width=700 height=700 class="center">
</p>

![](img/conventional.png)

## Neural compression standards

* Soundstream (Google, 2021) and Encodec (Meta, 2022)
* Fully trained models available to download and use

<p style="text-align:center;">
<img src="_images/encodec_architecture.png" width=700 height=700 class="center">
</p>

![](img/encodec_architecture.png)

## Neural codec transfer learning

* Example dataset: speech commands
  * Input $128 \times 128$ time-frequency distribution represented at full precision
  * Compressed size: $2 \times 75 \times 10$ binary codes
  * Size reduction of over $300\times$ with very small loss in speech intelligibility

| <audio controls="controls"><source src="./_static/left01.wav" type="audio/wav"></audio>      | <audio controls="controls"><source src="./_static/right01.wav" type="audio/wav"></audio>      | <audio controls="controls"><source src="./_static/yes01.wav" type="audio/wav"></audio>      | <audio controls="controls"><source src="./_static/no01.wav" type="audio/wav"></audio>      |
|----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| <audio controls="controls"><source src="./_static/left01_ecdc.wav" type="audio/wav"></audio> | <audio controls="controls"><source src="./_static/right01_ecdc.wav" type="audio/wav"></audio> | <audio controls="controls"><source src="./_static/yes01_ecdc.wav" type="audio/wav"></audio> | <audio controls="controls"><source src="./_static/no01_ecdc.wav" type="audio/wav"></audio> |

## Training on quantized data or discrete codes
* Ideally, we could just "replace" several low-precision inputs with a single high precision input
* Naive approach: $y = (x_1) + (x_2 << 1) + (x_3 << 2) + (x_4 << 3) \cdots$
  * Amounts to creating a categorical variable
  * Standard approach to training on categorical variable is to one-hot encode

## Open questions and project goals

* What is the best way to train on quantized data?
  * Binary neural networks
  * Exploit sparsity (feature hashing)
  * Others?
* How do current neural codecs perform on out of distribution data?
  * Test performance of encodec (trained on speech and music) on other types of audio signals
* How effective is this type of transfer learning?
  * Reduction in data collection?
  * Reduction in computation?