# Learning from Lossy Encoded Data
[Slides](https://danjacobellis.github.io/ITML/lossy_learning.slides.html)

<script>
    document.querySelector('head').innerHTML += '<style>.slides { zoom: 1.75 !important; }</style>';
</script>

<center> <h1>
Learning from Lossy Encoded Data
</h1> </center>

&nbsp;

<center> <h2>
Dan Jacobellis
</h2> </center>

## Lossy compression

* Most data are stored using lossy formats (MP3, JPEG)
* 1-4 bit subband quantization is typical
* ~1.5 bits per sample/pixel after entropy coding

<p style="text-align:center;">
<img src="_images/lossy_lossless.png" width=700 height=700 class="center">
</p>

![](img/lossy_lossless.png)

## Conventional Training Procedure

* Still suffers from all of the downsides of lossy compression
* Don't get any of the benefits of smaller representation!

<p style="text-align:center;">
<img src="_images/conventional.png" width=700 height=700 class="center">
</p>

![](img/conventional.png)

## Training on transformed data

* Reduce size and number of initial convolutional layers by training on DCT coefficients
* Input size remains the same (Quantization of transform coefficients ignored)

<p style="text-align:center;">
<img src="_images/dct_vs_resnet.png" width=800 height=800 class="center">
</p>

![](img/dct_vs_resnet.png)

## Training on transformed data

<p style="text-align:center;">
<img src="_images/FNNSFJ.png" width=700 height=700 class="center">
</p>

![](img/FNNSFJ.png)

## Low precision training

* bfloat16 (standardized in 2018)
* FP8 (E4M3 and E5M2, standardized in 2022)
* Fewer than 8 bits requires radical changes to training procedures
* Still not able to get any benefits of lossy encodings (1-4 bits per sample)

<p style="text-align:center;">
<img src="_images/gradient_quant.png" width=800 height=800 class="center">
</p>

![](img/gradient_quant.png)

## Cost of conventional approach

* Example: Imagenet
    * Decode image and convert to floating point
    * Size of input layer: $224 \times224 \times 3 \times 32$ bits per image
    * For batch size of 128:
        * 13 GB feature memory
        * 497 GFLOPs per pass

## Training on quantized data

* Each transform coefficient only contains about 2 bits instead of 8
* "Replace" each 2x2 block of 2-bit transform coefficients with a single high precision input
  * Size of input layer: $112 \times 112 \times 3 \times 32$ bits per image
  * 3 GB Feature memory (down from 13)
  * 131 GFLOPs per pass (down from 497)
* How do we "replace" several low-precision inputs with a single high precision input?

## Training on quantized data
* How do we "replace" several low-precision inputs with a single high precision input?
* Naive approach: $y = (x_1) + (x2 << 2) + (x2 << 4) (x2 << 6)$
  * Won't work for most codecs since each subband has different quantization
  * Amounts to creating a categorical variable
  * Standard approach to training on categorical variable is to one-hot encode
    * We're back to where we started!    

## Binarized neural networks
* Proposed in 2016 as a way to reduce memory consumption
* Replace some floating point arithmetic with bit-wise operations
  * Binary Weight Network (BWN): only the kernels are quantized
  * Binary Activation Network (BAN): only the inputs are binarized
  * Binary Neural Network (BNN): both the inputs as well as the kernels are binarized
* First layer of BNN is typically represented at full precision
  * Gives us an opportunity to utilize low precision of lossy coded data

## Lossy compression standards

| Standard       | Introduced | Signal | Transform         | Quantization                   |
|----------------|------------|--------|-------------------|--------------------------------|
| MPEG Layer III | 1991       | Audio  | block DCT and FFT | Perceptual quantization vector |
| JPEG           | 1992       | Image  | block DCT         | Perceptual quantization matrix |
| JPEG 2000      | 2000       | Image  | Separable Wavelet | Uniform scalar quantization    |
| CELT/Opus      | 2011       | Audio  | block DCT         | Pyramid vector quantization    |
| HEVC           | 2013       | Image  | block DCT and DST | Perceptual quantization matrix |
| Soundstream    | 2021       | Audio  | Learned           | Residual vector quantization   |
| Encodec        | 2022       | Audio  | Learned           | Residual vector quantization   |

## Audio classification: baseline
* Dataset: Speech commands
  * One second speech segments of 8 possible words
  * 'stop,' 'down,' 'no,' 'right,' 'go,' 'up,' 'yes,' 'left'
* Baseline model:
  * Input size: $128 \times 128$ time-frequency distribution represented at full precision
  * 119.52 MiB Feature size
  * 2.26 GFLOPs per pass
  * Achieves test accuracy of about 84% 

## Audio classification: VQ + BNN
* Encode 2x2 time-frequency blocks via vector quantization
  * Use mini-batch k-means to learn codebook of 16 vectors (4 bits)
  * Compression ratio of 16:1 (before any entropy coding)
* Input size: $64 \times 64 \times 4$ binary codes

| <audio controls="controls"><source src="./_static/left01.wav" type="audio/wav"></audio>    | <audio controls="controls"><source src="./_static/right01.wav" type="audio/wav"></audio>    | <audio controls="controls"><source src="./_static/yes01.wav" type="audio/wav"></audio>    | <audio controls="controls"><source src="./_static/no01.wav" type="audio/wav"></audio>    |
|--------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| <audio controls="controls"><source src="./_static/left01_vq.wav" type="audio/wav"></audio> | <audio controls="controls"><source src="./_static/right01_vq.wav" type="audio/wav"></audio> | <audio controls="controls"><source src="./_static/yes01_vq.wav" type="audio/wav"></audio> | <audio controls="controls"><source src="./_static/no01_vq.wav" type="audio/wav"></audio> |

## Audio classification: VQ + BNN
* Input size: $64 \times 64 \times 4$ binary codes
* 3.74 MiB feature size 
* Multiply-accumulate instead of FP
  * 4-way MAC unit uses about 55% of the area of a FP16 FPU
  * $4-8 \times$ more power efficient compared to bfloat16
  * $>20 \times$ more power efficient compared to FP32
* Achieves test accuracy of about 79% (down from baseline of 84%)

## Neural compression standards

* Soundstream (Google, 2021) and Encodec (Meta, 2022)

<p style="text-align:center;">
<img src="_images/encodec_architecture.png" width=700 height=700 class="center">
</p>

![](img/encodec_architecture.png)

## Audio classification: Neural Compression + BNN

| <audio controls="controls"><source src="./_static/left01.wav" type="audio/wav"></audio>      | <audio controls="controls"><source src="./_static/right01.wav" type="audio/wav"></audio>      | <audio controls="controls"><source src="./_static/yes01.wav" type="audio/wav"></audio>      | <audio controls="controls"><source src="./_static/no01.wav" type="audio/wav"></audio>      |
|----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| <audio controls="controls"><source src="./_static/left01_ecdc.wav" type="audio/wav"></audio> | <audio controls="controls"><source src="./_static/right01_ecdc.wav" type="audio/wav"></audio> | <audio controls="controls"><source src="./_static/yes01_ecdc.wav" type="audio/wav"></audio> | <audio controls="controls"><source src="./_static/no01_ecdc.wav" type="audio/wav"></audio> |

* Input size: $8 \times 75 \times 10$ binary codes
* 1.67 MiB Feature size 
* 111M MACs per pass
* Test accuracy of 58%

## References

[Faster Neural Networks Straight from JPEG](https://papers.nips.cc/paper/2018/hash/7af6266cc52234b5aa339b16695f7fc4-Abstract.html)

[Deep Residual Learning in the JPEG Transform Domain](https://openaccess.thecvf.com/content_ICCV_2019/html/Ehrlich_Deep_Residual_Learning_in_the_JPEG_Transform_Domain_ICCV_2019_paper.html)

[Ultra-Low Precision 4-bit Training of Deep Neural
Networks](https://proceedings.neurips.cc/paper/2020/file/13b919438259814cd5be8cb45877d577-Paper.pdf)

[Binarized Neural Networks](https://proceedings.neurips.cc/paper/2016/hash/d8330f857a17c53d217014ee776bfd50-Abstract.html)

[Estimates of memory consumption and FLOP counts for various convolutional neural networks](https://github.com/albanie/convnet-burden/blob/master/reports/resnet-50.md)