GlowQ is a low-rank correction method for quantized LLMs that reduces the latency and memory overhead of conventional per-layer restoration by sharing and caching a single right-factor projection across modules that consume the same input (e.g., QKV or MLP groups).
GlowQ-S is a selective variant that applies these cached shared corrections only to the groups/layers with the highest accuracy benefit, preserving most of the quality gains while further improving inference efficiency.
Anaconda/Miniconda is recommended.
cd GlowQ
conda env create -f environment.yml
conda activate glowq
python -m pip install --upgrade pipInstall a PyTorch build that matches your CUDA runtime/driver environment.
Example (pip, CUDA 12.1 wheels):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121If you only want a quick CPU-side smoke test (not recommended for full pipeline runtime):
pip install torch torchvision torchaudioenvironment.yml installs the non-PyTorch dependencies used by the GlowQ pipelines (Transformers, Datasets, plotting, Triton, lm-eval, etc.).
PyTorch is intentionally installed separately in Step 2 so you can choose a CUDA-matching build.
Use a CUDA-enabled PyTorch wheel (as above). This is usually enough to run the main pipeline without installing a full system CUDA toolkit manually.
If you enable the custom CUDA extension path (e.g., use_cuda_w4a16 = true), you will need to build the CUDA extension (recommended) or allow runtime JIT fallback. In either case, install a matching CUDA Toolkit on the system.
Checklist:
- NVIDIA driver is installed and
nvidia-smiworks. torch.cuda.is_available()returnsTrue.nvcc --versionis available (for CUDA extension builds).- CUDA Toolkit version is compatible with the PyTorch CUDA build you installed.
Useful checks:
nvidia-smi
nvcc --version
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("cuda available:", torch.cuda.is_available())
print("torch CUDA build:", torch.version.cuda)
PYIf CUDA extension build fails, set CUDA_HOME to your toolkit path (example):
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATHGlowQ's W4A16 path uses a custom CUDA extension in src/cuda_w4a16/.
Current default behavior is prebuilt extension import (recommended), not runtime JIT build.
Build once (or rebuild after editing CUDA sources):
cd GlowQ/src
# Set CUDA arch explicitly if auto-detection is unreliable on your server
# Example for A100: 8.0
export TORCH_CUDA_ARCH_LIST=8.0
python setup_cuda_w4a16.py build_ext --inplaceThis produces a compiled module under:
GlowQ/src/cuda_w4a16/w4a16_kernels*.so
Optional build knobs:
export MAX_JOBS=8 # parallel compile jobs (ninja)
export CUDA_HOME=/usr/local/cuda- Build the extension (above).
- In your config (
.toml), setuse_cuda_w4a16 = true. - Run the pipeline as usual (for example,
python run_glowq.py qwen_2_5_7b.toml).
Expected Step3 log when enabled:
Converting model to CUDA W4A16...
If import/build fails:
- Confirm
nvcc --versionworks and matches your CUDA/PyTorch environment. - Confirm
CUDA_HOMEis set correctly (see example above). - Rebuild after CUDA source changes:
python setup_cuda_w4a16.py build_ext --inplace
If you intentionally want the old runtime JIT fallback path:
export W4A16_ALLOW_JIT=1Then the extension can fall back to torch.utils.cpp_extension.load(...) when the prebuilt module is missing.
GlowQ currently provides two pipeline entry points:
run_glowq.py: main GlowQ pipeline (step1 -> step2 -> step3)run_glowq_s.py: restoration pipeline (step1 -> step2 -> step3_1 -> step4 -> step5)
Both scripts take one argument CONFIG (a TOML file path or a file name under ./configs).
run_glowq.py executes:
- Step1: quantization error extraction
- Step2: randomized GSVD / shared low-rank artifact generation
- Step3: evaluation (
step3_eval_dataset.pyorstep3_lm_eval.py)
Run with:
python run_glowq.py configs/qwen_2_5_7b.tomlYou can also pass only the config filename:
python run_glowq.py qwen_2_5_7b.tomlrun_glowq_s.py executes:
- Step1: restoration quantization error extraction
- Step2: restoration randomized GSVD
- Step3_1: importance ranking computation
- Step4: cumulative restoration evaluation
- Step5: final comparison plot generation
Run with:
python run_glowq_s.py configs/qwen_2_5_7b.tomlRecent update:
step3_1importance ranking now supports configurable metrics from TOML viaimportance_metric.- Default metrics are
gsvd,norm_error(GSVD score + normalized error ratio).
Config templates are in ./configs/.
Examples:
configs/qwen_2_5_7b.tomlconfigs/llama_3_2_3b.tomlconfigs/mistral_7b.toml
Typical fields include:
model_namesvd_rankcalibration_datasetcalibration_n_samplesppl_datasetlm_harnessdevicegroup_sizeuse_cuda_w4a16trust_remote_codeoutput_dirglowq_s(optional; marker flag for restoration pipeline configs)importance_metric(optional; used byrun_glowq_s.pyStep3_1)
To use the custom CUDA W4A16 kernel path in Step3, set use_cuda_w4a16 = true in your config (TOML). If this is false, GlowQ uses the default Triton 4-bit path (or FP16 fallback when Triton is unavailable).
run_glowq_s.py reads importance_metric from the TOML config and passes it to src/restoration/step3_1_calculate_importance.py.
Default:
glowq_s = true
importance_metric = "gsvd,norm_error"Supported metrics (comma-separated):
gsvdnorm_errorfrobenius_norm_errorcosine_similaritylayer_order
Useful aliases are also accepted (for example: norm, normalized, fro, cosine, layer).
If lm_harness = true in config, run_glowq.py step3 uses src/step3_lm_eval.py, which requires lm-eval.
Default output directory:
GlowQ/outputs/<config_stem>/
Typical artifacts:
step1/
quant_error.pt
original_weights.pt
step2/
low_rank_shared.pt
b_ref_map.json
logs/
step2_rsvd.log
Default output directory:
GlowQ/outputs/<config_stem>/restoration/
Typical artifacts:
step1/
quant_error.pt
original_weights.pt
step2/
low_rank_shared.pt
b_ref_map.json
step3_1/
importance_rankings.json
step4/
cumulative_results.csv
step5/
final_ppl_comparison_plot.png
conda activate glowq
# Main pipeline
python run_glowq.py qwen_2_5_7b.toml
# Restoration pipeline (optional)
python run_glowq_s.py qwen_2_5_7b.tomlTriton is not installed: installtriton, or run paths that allow Triton-disabled fallback.lm-eval-harness not available: installlm-evaland setlm_harness = trueonly when needed.- CUDA extension build/runtime issues: verify
nvidia-smi,nvcc --version, and PyTorch CUDA compatibility. - Hugging Face model loading errors with community models: set
trust_remote_code = truein the config when required.
@inproceedings{
an2026glowq,
title={GlowQ: Group-Shared {LO}w-Rank Approximation for Quantized {LLM}s},
author={Selim An and Il hong Suh and Yeseong Kim},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=kVojSLUcvS}
}
