-
Notifications
You must be signed in to change notification settings - Fork 2
Proposal on QAT LLM quantization
From Information Theory LLM is a discrete object. So considering quantization noise as channel error as long as LLM capacity as channel is higher than data rate it can be quantized almost without degradation. The problem is the weights optimization for the discrete object has exponential complexity while SGD training methods which consider LLM as continual have polynomial complexity. But continuity means an infinite accuracy and redundant bits for LLM FP weights and activations. One approach is just adding some quantization noise (PTQ), but this always distorts activation distributions and reduces quality. PTQ directly operates on LLM weights and activations without or with very small amount of training data, so it has low complexity.
Discrete-continuous contradiction can be solved by considering discrete (quantized) LLM as continuous object with noise. Similarly to Shannon’s theorem on AWGN channel capacity in this case SNR defines the channel bit-width. SGD can still be applied to continuous part of LLM giving reasoning to QAT methods.
QAT can be fixed-width and mixed-width. In later case the optimal bit-width is selected for each channel based on quality. Mixed-width QAT also covers pruning because it corresponds to zero bit quantization of pruned channels.
Unfortunately, T in QAT means training. LLM training is prohibitive both due to computations and amount and availability of training data. So for practical purposes it is necessary to develop data-free QAT with similar complexity to PTQ.
- LLM PTQ
- Pro: PTQ does not need training and training data.
- Contra: PTQ reduces NN quality as it distorts activation distributions due to quantization error.
- Bit-width differentiable CNN QAT with random noise injection
- Pro: QAT can recover quality if NN capacity is enough. Mixed-precision quantization.
- Contra: QAT is the same speed or slower than FP training which is prohibitable for LLM. Training data is needed.
Data-free fast QAT.
- Identify small LLM which can be trained locally
- Several transformer-based models can be chosen
- MobileBERT (25.3M)
- MobileBERT_tiny (15.1M)
- Although most of the papers provide benchmarks for fairly large models, such as LLAMA 7B, etc.
- Several transformer-based models can be chosen
- Setup inference and training pipeline
- Repeat SOTA training quality (define SOTA)
- Integrate per-channel mixed-precision QAT into LLM training
- Different bit-widths may be considered for different parts of the transformer
- Embedding layers
- FFN (perceptron) layers
- Multi-head Attention layers
- Quantize LLM with various bit-width targets
- Compare quality with PTQ
- Different bit-widths may be considered for different parts of the transformer
- Implement random text generation using pretrained FP LLM by random sampling from generator input distribution with empty prompt
- Implement LLM data-free distillation using quantized network as student.
- Integrate distillation and QAT
- Compare results with stage 1 QAT
- Quantize LLM by 3 layer tiles one-by-one starting from input. Use FP LLM tile output as reference output with L1/L2 distance (Hinton-like forward learning).
- Integrate forward tiled QAT with stage 2 distillation
- Compere results with stage 1,2 QAT
- Evaluate on larger LLMs
For CNN each channel has individual scale and per channel weights are of the same magnitude. It is similar to blocks in block quantization, as each block has individual scale. It seems natural to identify LLM channels with blocks used for quantization.
There are several types of layers with weights that transformer consist of. Most of the layers are Linear aka Fully-connected layers that composed in a different ways.
- Embedding
- https://arxiv.org/abs/2109.12948 - introduced PEG - per-embedding group quantization for embedding activations
- FFN, Multi-Head attention
- For these types of layers several techniques are exist.
- per-row quantization - basically each row has it’s own scale factor
- group-wise quantization - approach used in LUT-GEMM, varying amount of scale-factors per-group in specific layer channel.
- block-wise quantization - approach used in GGUF/GGML, which is similar to group-wise quantization, but no respect to channels.
- For these types of layers several techniques are exist.