Add LoRA fine-tuning to AWQ #85

RonanKMcGovern · 2023-10-01T14:38:59Z

It would be fantastic if we could add the ability to do LoRA fine-tuning and merging of adapters.

Background on QLoRA

Interestingly, for many fine-tunings, the results of QLoRA are very similar to doing an unquantized LoRA fine-tune. Of course, QLoRA allows for fine-tuning with one third of the VRAM requirement (if doing 4bit).

The two common libraries I use are:

bitsandbytes: This does not allow correct merging of adapters to the dequantized base model.
gptq: also does not allow merging of adapters, plus the perplexity is worse than awq (and bnb in some cases).

Why add LoRA to AWQ

AWQ has the best perplexity and good inference speed
If it were possible to do QLoRA AND merge adapters to the base dequantized model, AWQ would be the best available solution for doing fine-tuning, at least in quantized form.

casper-hansen · 2023-10-01T15:32:58Z

I would love to add LoRA and make AutoAWQ compatible with PEFT. This is something that I have thought about but currently it’s more important for me to see what I can do a high throughput quantized model.

RonanKMcGovern · 2023-10-01T22:36:18Z

Ok cool, I think supporting QLoRA merging is underappreciated though. I don't know of any way to do this and it means there isn't a good open source way to serve QLoRA tuned open source models.

BTW, when you say high throughput, do you mean batch size larger than 8? so bf16 implementation?

s4rduk4r · 2023-10-09T10:28:15Z

I could probably look into it during next week. Maybe the autograd_4bit code from here could be adapted somehow

casper-hansen · 2023-10-09T10:42:15Z

In general, I think we should integrate with PEFT. From my understanding, this requires our WQLinear modules to generate gradients during a backward pass - so you would have to implement that functionality. It may turn out to be easy enough since autograd works pretty well - maybe look to AutoGPTQ to see how they integrated with PEFT.

K024 · 2023-10-10T13:18:26Z

@casper-hansen AutoGPTQ implements QLinear with various underlying QGEMM implementations (cuda, exllama, qigen, openai/triton) and most of them did not implement the backward kernel except for triton. The triton kernel is currently the only one could be used for training a quantized model in AutoGPTQ, though not the most optimal.

FYI the autograd_4bit mentioned above simply unpacks the weights into fp and calls torch.matmul

casper-hansen · 2023-10-10T13:25:39Z

I welcome any work on a backward pass function for AWQ. There are many ways to go about it. Just keep in mind the AWQ kernel does not scale well with larger batch sizes, above batch size 16 and it will be slower than FP16. I found some code where someone did the backward pass:

https://github.com/compressa-ai/llm-awq/tree/dev

K024 · 2023-10-10T13:52:55Z

@casper-hansen FYI the above one still unpacks and gemms everything in fp...

~~And I noticed the pack order had changed in the llm-awq repo since Sep 7~~

I see the changes gemmv2_forward_cuda vs gemm_forward_cuda

casper-hansen · 2023-10-10T14:57:22Z

Yes, I see that, they dequantize to run FP16. I’m pretty sure this is normal for training?

I created v2 based on their new GEMM kernel but it’s way slower and only compatible with GEMV where it processes the context. GEMV is 20% faster at small prompts but not great for high throughput or deployments.

Ph0rk0z · 2023-10-11T17:07:19Z

ime, triton was never faster for anything. exclusionary high compute requirements and slower speed, oh my.

The only one who has pulled off merging adapters into quantized models is GGUF. With that alpaca_lora_4bit repo + extensions I can merge LoRA together but not to the model.

cassianlewis · 2023-11-08T12:04:28Z

AFAIK you can merge the LoRA weights and unquantised base model (even if you fine-tuned in 4/8 bit) using model.merge_and_unload(). You can then quantise this model using AQW and run as normal.

I guess this only really applies if you don't have the VRAM to train the model without PEFT though.

RonanKMcGovern · 2023-11-08T19:48:02Z

@cassianlewis yeah in bnb but not gptq AFAIK.

Not ideal to merge to unquantified either.

sd3ntato · 2024-01-16T12:00:09Z

AFAIK you can merge the LoRA weights and unquantised base model (even if you fine-tuned in 4/8 bit) using model.merge_and_unload(). You can then quantise this model using AQW and run as normal.

I guess this only really applies if you don't have the VRAM to train the model without PEFT though.

Hi, I'm trying to do it with Mixtral, but i get the following output / error:

Downloading and preparing dataset json/mit-han-lab--pile-val-backup to /root/.cache/huggingface/datasets/mit-han-lab___json/mit-han-lab--pile-val-backup-39bc257d0ce73de2/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
 82 Downloading readme: 100%|██████████| 167/167 [00:00<00:00, 112kB/s]
 83 Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]
 84 Downloading data:  48%|████▊     | 225M/471M [00:01<00:01, 125MB/s]
 85 Extracting data files: 100%|██████████| 1/1 [00:02<00:00,  2.22s/it]
 86 AWQ:   0%|          | 0/32 [00:06<?, ?it/s]
 87 Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/mit-han-lab___json/mit-han-lab--pile-val-backup-39bc257d0ce73de2/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.
 88 AWQ:   0%|          | 0/32 [00:00<?, ?it/s]
 89 Traceback (most recent call last):
 90   File "/opt/ml/code/run_clm_awq.py", line 339, in <module>
 91     main()
 92   File "/opt/ml/code/run_clm_awq.py", line 324, in main
 93     training_function(run, args)
 94   File "/opt/ml/code/run_clm_awq.py", line 292, in training_function
 95     model.quantize(
 96   File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
 97     return func(*args, **kwargs)
 98   File "/opt/conda/lib/python3.10/site-packages/awq/models/base.py", line 95, in quantize
 99     self.quantizer.quantize()
100   File "/opt/conda/lib/python3.10/site-packages/awq/quantize/quantizer.py", line 107, in quantize
101     module_config: List[Dict] = self.awq_model.get_layers_for_scaling(
102   File "/opt/conda/lib/python3.10/site-packages/awq/models/mixtral.py", line 46, in get_layers_for_scaling
103     inp=input_feat['self_attn.q_proj'],
104 KeyError: 'self_attn.q_proj'

could anyone please help me out with this?

RonanKMcGovern · 2024-01-17T11:20:38Z

If you merge a quantized (transformers) model then it will become a 4 or 8 bit model, which you can't then do AWQ on.

Instead, you would need to reload a base model in 16 bit and merge your LoRA to that (using merge and unload). Then you can AWQ that merged model. More info in this vid.

Ph0rk0z · 2024-01-19T12:51:18Z

Having the base model becomes unmanageable with 70b+, that's part of the issue. They're 160gb+

RanchiZhao · 2024-04-15T12:30:01Z

Hi! any progress? is train LoRA modules with AWQ available now?

RicardoHalak · 2024-04-18T10:48:12Z

Hi, I'm also interested to know whether LoRA + AWQ is already available now. Thanks!

RanchiZhao · 2024-04-18T12:26:46Z

Hi, I'm also interested to know whether LoRA + AWQ is already available now. Thanks!

@RicardoHalak see this, is runnable huggingface/transformers#28987

casper-hansen added the enhancement New feature or request label Nov 9, 2023

casper-hansen mentioned this issue Dec 3, 2023

Is it available fine tuning quantized model using peft library? #233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LoRA fine-tuning to AWQ #85

Add LoRA fine-tuning to AWQ #85

RonanKMcGovern commented Oct 1, 2023

casper-hansen commented Oct 1, 2023

RonanKMcGovern commented Oct 1, 2023

s4rduk4r commented Oct 9, 2023

casper-hansen commented Oct 9, 2023

K024 commented Oct 10, 2023 •

edited

casper-hansen commented Oct 10, 2023 •

edited

K024 commented Oct 10, 2023 •

edited

casper-hansen commented Oct 10, 2023

Ph0rk0z commented Oct 11, 2023

cassianlewis commented Nov 8, 2023

RonanKMcGovern commented Nov 8, 2023

sd3ntato commented Jan 16, 2024

RonanKMcGovern commented Jan 17, 2024

Ph0rk0z commented Jan 19, 2024

RanchiZhao commented Apr 15, 2024

RicardoHalak commented Apr 18, 2024

RanchiZhao commented Apr 18, 2024

Add LoRA fine-tuning to AWQ #85

Add LoRA fine-tuning to AWQ #85

Comments

RonanKMcGovern commented Oct 1, 2023

casper-hansen commented Oct 1, 2023

RonanKMcGovern commented Oct 1, 2023

s4rduk4r commented Oct 9, 2023

casper-hansen commented Oct 9, 2023

K024 commented Oct 10, 2023 • edited

casper-hansen commented Oct 10, 2023 • edited

K024 commented Oct 10, 2023 • edited

casper-hansen commented Oct 10, 2023

Ph0rk0z commented Oct 11, 2023

cassianlewis commented Nov 8, 2023

RonanKMcGovern commented Nov 8, 2023

sd3ntato commented Jan 16, 2024

RonanKMcGovern commented Jan 17, 2024

Ph0rk0z commented Jan 19, 2024

RanchiZhao commented Apr 15, 2024

RicardoHalak commented Apr 18, 2024

RanchiZhao commented Apr 18, 2024

K024 commented Oct 10, 2023 •

edited

casper-hansen commented Oct 10, 2023 •

edited

K024 commented Oct 10, 2023 •

edited