# Compressed Learning

[original proposal](https://danjacobellis.net/ITML/_static/proposal.pdf)

## Lossy compression standards

While hundreds of lossy compression techniques have been standardized, only a few have been widely deployed. We will put aside inter-frame video compression for now and focus on audio, image, and intra-frame video compression since our goal is to exploit the transform and quantization steps.

| Standard       | Introduced | Signal | Transform         | Quantization                   |
|----------------|------------|--------|-------------------|--------------------------------|
| MPEG Layer III | 1991       | Audio  | block DCT and FFT | Perceptual quantization vector |
| JPEG           | 1992       | Image  | block DCT         | Perceptual quantization matrix |
| JPEG 2000      | 2000       | Image  | Separable Wavelet | Uniform scalar quantization    |
| CELT/Opus      | 2011       | Audio  | block DCT         | Pyramid vector quantization    |
| HEVC           | 2013       | Image  | block DCT and DST | Perceptual quantization matrix |
| Soundstream    | 2021       | Audio  | Learned           | Residual vector quantization   |

### Transforms
The block DCT and its variants dominate among commonly used codecs. However, emerging standards use learned encoders. Soundstream (Google) and Encodec (Meta), have recently become standardized. Both are based on the vector quantized variational autoencoder (VQ-VAE). Due to the computational cost of the encoder and decoder, it may seem implausible that these learned codecs are widely adopted. But considering that such a large model as stable diffusion can run on an iPhone, the computational cost may eventually be reasonable.

### Quantization
The dominant approach to quantization is to define a perceptual quantization matrix (or vector) $Q$ and perform the simple quantization $B = round(\frac{G}{Q})$, where $G$ is the transformed signal and $B$ is the quantized representation. In CELT (now integrated into the OPUS standard) transform coefficients are normalized to one. This unit vector is then coded separately from the magnitude, which allows the use of pyramid vector quantization which is much more efficient but only applicable to unit vectors. In the learned codecs (Soundstream and Encodec) several low-complexity vector quantizers are used, each predicting the residual of the previous.

## Integrating compression and learning

Consider JPEG, where DCT coefficients are generally quantized to between 2-5 bits per pixel but the standard training approach represents all inputs using 32-bit floating point values.

It would be nice if we could simply represent 2x2 or perhaps even 4x4 blocks using a single 32-bit floating point number, thus reducing the size of the input layer by an order of magnitude, and simplifying training. However, this would amount to encoding blocks as a categorical variable. Since the standard approach to learning a categorical variable is to one-hot encode, we are back to where we started!

Recently, approaches to training with lower precision representations have been explored, but it is unlikely that anything fewer than 8 bits can be used without drastically changing the training procedure.

### Post training quantization and quantization aware training

The demand to perform inference of deep learning models on mobile and embedded devices has resulted in many techniques for model quantization. With a few clever techniques, it is possible to quantize the weights of a traditionally trained network to low precision (e.g. 8 bits) and retain good performance. To quantize weights even further to 4 or fewer bits, techniques for quantization aware training have been proposed. Unfortunately, the term "quantization-aware" only refers the fact that the network will eventually be quantized, not the fact that the training inputs are quantized. Thus, these techniques are not immediately applicable to reduce the cost of training by exploiting the quantization of encoded inputs.

### Binary neural networks

Binary neural networks are a more drastic adaptation to aimed to reduce model size. One limitation of BNNs is that the first input layer must typically be represented at full precision, while the remaining signal path can be represented using binary inputs, weights, and activations.

With the traditional training setup, it did not provide an advantage to represent groups of low-precision inputs using a categorical variable since it would be expand it anyway (e.g. using one-hot encoding). However, with a binary neural network, this expansion is less of a cost because it allows us to reduce the input layer from full precision.

## Experiment: 4x4 image blocks encoded to 4 bits as input to BNN

In this experiment, we encode 4x4 blocks of MNIST digit images to 4-bit codes, a reduction in representation size of 128x. We train two BNNs:

* The first network, which will act as the control in the experiment, is trained on the original MNIST images and the input layer is represented using full precision.
* The second network is trained on lossy 4-bit codes and the input layer is binary.

We will use a VQ-VAE trained on MNIST as our "transform + quantization" lossy compression algorithm. We use four latent dimensions and allow 16 possible codebook vectors (i.e. 4 bit codes) in our VQ-VAE to represent each 4x4 image block.

### Results

The baseline network reaches about 98% accuracy after 10 epochs, requires 2,599,552 1-bit MACs, and 194,688 32-bit MACs.

The network with binary inputs only reaches 85% accuracy after 10 epochs, but only requires 4,736 1-bit MACs and 12,544 32-bit MACs.

The number of model parameters is reduced from 93,000 to 17,000, and the overall model size is reduced from 365 KiB to 68 KiB
