# Transfer Learning from Lossy Codecs
[Slides](https://danjacobellis.github.io/SYSML/progress.slides.html)

<script>
    document.querySelector('head').innerHTML += '<style>.slides { zoom: 1.75 !important; }</style>';
</script>

<center> <h1>
Transfer Learning from Lossy Codecs
</h1> </center>

&nbsp;

<center> <h2>
Dan Jacobellis
</h2> </center>

## Approaches to neural compression

### Transformer

* [Paper: Variable-Rate Deep Image Compression With Vision Transformers](https://ieeexplore.ieee.org/abstract/document/9770776)

### Optimize parameters of a nonlinear transform code
* [Paper: End-to-end optimized image compression](http://www.cns.nyu.edu/pub/eero/balle17a-final.pdf)
  * [Code example (tensorflow documentation)](https://www.tensorflow.org/tutorials/generative/data_compression)
* [Paper: Neural Data-Dependent Transform for Learned Image Compression](https://openaccess.thecvf.com/content/CVPR2022/html/Wang_Neural_Data-Dependent_Transform_for_Learned_Image_Compression_CVPR_2022_paper.html)
  * [Code example with pretrained model](https://github.com/Dezhao-Wang/Neural-Syntax-Code)

### Vector-quantized variational autoencoder
* [Paper: Neural Discrete Representation Learning](https://proceedings.neurips.cc/paper/2017/hash/7a98af17e63a0ac09ce2e96d03992fbc-Abstract.html)
  * [Code example (keras documentation)](https://keras.io/examples/generative/vq_vae/)
  * [Standardized codec for speech and music: "Encodec"](https://github.com/facebookresearch/encodec)
  * [Dan's slides on VQ-VAE](https://danjacobellis.net/ITML/discrete_representation_learning.html)

### RNN-based generative model of speech
* [Paper: Generative speech coding with predictive variance regularization](https://ieeexplore.ieee.org/abstract/document/9415120?casa_token=dZRQjc-xqesAAAAA:UxxPxExec7YEAFOdHvM5L0fPMa3LjVNz8UJpeqoAQEwUds6j5ng5Nik5SnPcBlGsPQT2q2HG)
  * [Standardized codec for speech only: "Lyra"](https://github.com/google/lyra)

### Conditional GAN for images

* [Paper: High-Fidelity Generative Image Compression](https://proceedings.neurips.cc/paper/2020/hash/8a50bae297807da9e97722a0b3fd8f27-Abstract.html)
  * [Code example with pretrained model](https://github.com/tensorflow/compression/tree/master/models/hific)

## Neural network structures for learning from quantized data

### Binary Neural Networks

* [Larq: Library for implementing BNNs](https://docs.larq.dev/larq/)
* [Dan's slides](https://danjacobellis.net/ITML/lossy_learning.slides.html#/)

### One-hot encode, then exploit sparsity
* [Paper: Learning on tree architectures outperforms a convolutional feedforward network](https://www.nature.com/articles/s41598-023-27986-6)
  * [Code example for CIFAR](https://github.com/yuval-meir/Tree-3)

## Transfer learning / Self supervised learning

### Conventional transfer learning from pretrained mobilenet

* [Moblilenet v2 (images)](https://www.tensorflow.org/tutorials/images/transfer_learning)

* [YAMNet (audio)](https://www.tensorflow.org/tutorials/audio/transfer_learning_audio)

### Self supervised learning

* [wav2vec](https://www.isca-speech.org/archive_v0/Interspeech_2019/abstracts/1873.html)
  * [wav2vec 2.0 on github](https://github.com/facebookresearch/fairseq/blob/main/examples/wav2vec/README.md)
  

## Datasets and models for experiments

### Images

* Imagenet-1k and Mobilenet variants
  * Mobilenet models allow tradeoff between model complexity and accuracy
  * Two main hyperparameters: Width multiplier and resolution multiplier
    * Width multipler $\alpha \in (0,1]$ controls the number of channels at each layer. Computational cost is proportional to $\alpha^2$
    * Resolution multiplier $\rho \in (0,1]$ controls the resolution of each channel. Computational cost is proportional to $\rho^2$

![](img/top1_vs_MAdd.png)

![](img/top1_vs_latency.png)

[source](https://openaccess.thecvf.com/content_ICCV_2019/html/Howard_Searching_for_MobileNetV3_ICCV_2019_paper.html)

### Audio

* Current audio models use similar CNN to images, but applied to a time-frequency representation of the audio

* [Pretrained model: YAMNet](https://tfhub.dev/google/yamnet/1)
  * [github repo describing pipeline](https://github.com/tensorflow/models/tree/master/research/audioset/yamnet)
    * Data is resampled to 16 kHz
    * A time-frequency transform is applied
    * input size is 96x64
  * [Audioset](https://research.google.com/audioset/)
    * [Subset consisting of 521 classes](https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet_class_map.csv)


