Cannot merge LORA layers when the model is loaded in 8-bit mode #29

yangjianxin1 · 2023-05-25T16:45:50Z

When I load the model as following, throw the error: Cannot merge LORA layers when the model is loaded in 8-bit mode
How can I load model with 4bit when inferencing?
model_path = 'decapoda-research/llama-30b-hf' adapter_path = 'timdettmers/guanaco-33b' quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4' ), model = AutoModelForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, load_in_4bit=True, quantization_config=quantization_config, torch_dtype=torch.float16, device_map='auto' ) model = PeftModel.from_pretrained(model, adapter_path) model = model.merge_and_unload()

The text was updated successfully, but these errors were encountered:

yangjianxin1 · 2023-05-25T16:46:57Z

model_path = 'decapoda-research/llama-30b-hf' 
adapter_path = 'timdettmers/guanaco-33b'
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4'
),
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    low_cpu_mem_usage=True,
    load_in_4bit=True,
    quantization_config=quantization_config,
    torch_dtype=torch.float16,
    device_map='auto'
)
model = PeftModel.from_pretrained(model, adapter_path)
model = model.merge_and_unload()

USBhost · 2023-05-25T17:22:45Z

Just fyi decapoda-research is extremely out of date. Please use huggyllama instead.

bodaay · 2023-05-25T19:28:56Z

did you solve this? and its the same result with huggyllama

KKcorps · 2023-05-25T20:24:59Z

@bodaay what is the sie of the adapter.bin you are getting? Mine is like in bytes.

Btw, I just commented out the model = model.merge_and_unload() line and it works. Merging is not necessary

yangjianxin1 · 2023-05-26T02:24:01Z

remove "model = model.merge_and_unload()", and it works.

bodaay · 2023-05-26T07:33:21Z

@bodaay what is the sie of the adapter.bin you are getting? Mine is like in bytes.

Btw, I just commented out the model = model.merge_and_unload() line and it works. Merging is not necessary

when I re-save the model, its proper 3.2GB

hemangjoshi37a · 2023-05-28T08:19:58Z

@yangjianxin1, Based on the code snippet you provided, it seems that you are loading the model using AutoModelForCausalLM with 4-bit quantization enabled. However, when attempting to merge the LORA layers using merge_and_unload(), the mentioned error occurs.

To address this issue, we recommend the following approach:

Remove the line model = model.merge_and_unload() from your code. The merging step is not necessary for inference and can be omitted.

By removing the merge_and_unload() line, you should be able to successfully load and use the model for inference without encountering the "Cannot merge LORA layers when the model is loaded in 8-bit mode" error.

larawehbe · 2023-05-31T13:24:06Z

@yangjianxin1, Based on the code snippet you provided, it seems that you are loading the model using AutoModelForCausalLM with 4-bit quantization enabled. However, when attempting to merge the LORA layers using merge_and_unload(), the mentioned error occurs.

To address this issue, we recommend the following approach:

Remove the line model = model.merge_and_unload() from your code. The merging step is not necessary for inference and can be omitted.

By removing the merge_and_unload() line, you should be able to successfully load and use the model for inference without encountering the "Cannot merge LORA layers when the model is loaded in 8-bit mode" error.

How can i optain one model so i can use it in llama.cpp if i cant merge them?

Do you have any idea how to make the fine-tuned model applicable inside llama.cpp ?

Any help is highly appreciated

KiranNadig62 · 2023-06-19T16:35:54Z

Removing merge_and_unload() is not the solution!!.
Of course, the inference works fine, there is no attempt to merge your LORA weights into the base model. What is the real solution here?

samos123 · 2023-07-06T18:56:11Z

You can see a workaround here: https://github.com/substratusai/model-falcon-7b-instruct/blob/430cf5dfda02c0359122d4ef7f9b6d0c01bb3b39/src/train.ipynb

Effectively I reload the base model in 16 bit to work around the issue. It works fine for my use case.

ashmitbhattarai · 2023-07-11T07:03:04Z

Anyone found a way to solve this without loading the model in 16bit? My GPU cannot load whole Falcon-40B (in 16-bit), using device_map=auto and offload_folder causes the python code to be Killed. Loading the base model in 4 bit mode and merging with LORA adapters still fail with "Cannot merge LORA layers in 8-bit mode"

jocastrocUnal · 2023-07-22T03:25:10Z

I hope in the future this code could work... its more natural.

model = peft_model.merge_and_unload()
model.save_pretrained("/model/trained")

MrigankRaman · 2023-08-02T13:59:52Z

Any updates on this?

larawehbe · 2023-08-03T09:07:30Z

You can see a workaround here: https://github.com/substratusai/model-falcon-7b-instruct/blob/430cf5dfda02c0359122d4ef7f9b6d0c01bb3b39/src/train.ipynb

Effectively I reload the base model in 16 bit to work around the issue. It works fine for my use case.

The link is broken

xpang-sf · 2023-08-03T22:59:59Z

You can see a workaround here: https://github.com/substratusai/model-falcon-7b-instruct/blob/430cf5dfda02c0359122d4ef7f9b6d0c01bb3b39/src/train.ipynb

Effectively I reload the base model in 16 bit to work around the issue. It works fine for my use case.

Same here, the link is broken. can pls re-share the link?

samos123 · 2023-08-04T00:08:33Z

New link: https://github.com/substratusai/images/blob/main/model-trainer-huggingface/src/train.ipynb

xpang-sf · 2023-08-04T01:08:30Z

New link: https://github.com/substratusai/images/blob/main/model-trainer-huggingface/src/train.ipynb

Thank you so much!

xpang-sf · 2023-08-04T05:51:10Z

New link: https://github.com/substratusai/images/blob/main/model-trainer-huggingface/src/train.ipynb

May I ask another baby question: this is training code and saving model code. After model is saved, did you test to load the saved model and do inference and check whether generated results are good? If so, do you have this separated inference/generating script? Thanks a lot in advance.

samos123 · 2023-08-04T06:40:34Z

Yes in substratus.ai we separate model loading, finetuning and serving in separate images. I did check whether the finetuned model provided different results and it did.

In the notebook that I linked, the following paths are used

# original base model e.g. falcon-7b
model_path = "/content/saved-model/"
# path of final finetuned model (merged model)
trained_model_path = "/content/model"

xpang-sf · 2023-08-04T17:42:56Z

Yes in substratus.ai we separate model loading, finetuning and serving in separate images. I did check whether the finetuned model provided different results and it did.

In the notebook that I linked, the following paths are used
# original base model e.g. falcon-7b
model_path = "/content/saved-model/"
# path of final finetuned model (merged model)
trained_model_path = "/content/model"

Thank you so much, I will try on my side and let you know.

AegeanYan · 2023-08-08T01:49:55Z

I am worried about whether this quick fix would be harmful to the model's ability? Is there any other way to fix this problem?

hamidahmadian · 2023-09-27T09:49:06Z

Anyone found a way to solve this without loading the model in 16bit? My GPU cannot load whole Falcon-40B (in 16-bit), using device_map=auto and offload_folder causes the python code to be Killed. Loading the base model in 4 bit mode and merging with LORA adapters still fail with "Cannot merge LORA layers in 8-bit mode"

I found a way that works in my case I hope it works, the problems working with Llama2 in terms of training time, inference time while we just have one (not big memory) GPU can be splited into 3 different parts which I will go through each of them.

Training: It is not possible to perform pure 8bit or 4bit training so by leveraging parameter efficient fine tuning methods (PEFT) and train for example adapters on top of them we can fine tune Llama2 model. prepare_model_for_kbit_training plays an important role for preparing model for training.
Merging: on the other hand for merging we can merge our LoRa Layers to Llama2 even on CPU. here is sample template:

llama_model = AutoModel.from_pretrained(model_name, 
                        device_map={"": "cpu"}, 
                        torch_dtype=torch.float16)

model = PeftModel.from_pretrained(llama_model, 
                        peft_model_path, 
                        torch_dtype=torch.float16, 
                        device_map={"": "cpu"})

model = model.merge_and_unload()

model.save_pretrained("merged-model")

Inference Time: for loading now you can just need to read model from merged-model path, you can load model in 8bit or whatever needs. here is sample template:

model_for_infer = AutoModel.from_pretrained("merged-model",
                        device_map="auto",
                        load_in_8bit=True)

samuelhkahn · 2023-09-28T19:45:49Z

While not the most elegant solution, @hamidahmadian solution works for me.

samos123 · 2023-10-19T18:17:53Z

I have an issue with this approach when I add a special token. Anyone figure out a way to do that? Code I'm using:

base_model = AutoModelForCausalLM.from_pretrained(
            model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
base_model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(base_model, trained_model_path_lora, torch_dtype=torch.float16)
model = model.merge_and_unload()

error observed:

>> model = PeftModel.from_pretrained(base_model, trained_model_path_lora, torch_dtype=torch.float16)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1158, in Module.to.<locals>.convert(t)
   1155 if convert_to_format is not None and t.dim() in (4, 5):
   1156     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1157                 non_blocking, memory_format=convert_to_format)
-> 1158 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

NotImplementedError: Cannot copy out of meta tensor; no data!

eternitybt · 2023-11-15T16:17:20Z

On my laptop with 16GB RAM + 16GB VRAM, @hamidahmadian's solution allows me to load the models, but still gives me an OOM error when doing merge_and_unload(). However, the following works for me:

llama_model = AutoModelForCausalLM.from_pretrained(model_name, device_map={"": "cpu"}, torch_dtype=torch.float16)
model = PeftModel.from_pretrained(llama_model, peft_model_path)
model = model.merge_and_unload()

praharshbhatt · 2024-06-21T04:27:16Z

This should be the correct way to fix the issue:

peft_model = model

# When you execute the commonly used `model = model.merge_and_unload()`, the error `Cannot merge LORA layers when the model is loaded in 8-bit mode` occurs. The reason is that the base model was loaded as 4 bit. Therefore, the base model must be reloaded as 16 bit.
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    load_in_8bit=False,
    device_map="auto",
    trust_remote_code=True,
)
from peft import PeftModel

peft_model = PeftModel.from_pretrained(model, NEW_MODEL_PATH)
merged_model = peft_model.merge_and_unload()

ytchen175 mentioned this issue Nov 6, 2023

Use AutoPeftModelForCausalLM to merge and save model cause crashed in the low VRAM situation huggingface/peft#1083

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot merge LORA layers when the model is loaded in 8-bit mode #29

Cannot merge LORA layers when the model is loaded in 8-bit mode #29

yangjianxin1 commented May 25, 2023

yangjianxin1 commented May 25, 2023

USBhost commented May 25, 2023 •

edited

Loading

bodaay commented May 25, 2023

KKcorps commented May 25, 2023

yangjianxin1 commented May 26, 2023

bodaay commented May 26, 2023

hemangjoshi37a commented May 28, 2023

larawehbe commented May 31, 2023

KiranNadig62 commented Jun 19, 2023

samos123 commented Jul 6, 2023

ashmitbhattarai commented Jul 11, 2023 •

edited

Loading

jocastrocUnal commented Jul 22, 2023

MrigankRaman commented Aug 2, 2023

larawehbe commented Aug 3, 2023

xpang-sf commented Aug 3, 2023

samos123 commented Aug 4, 2023

xpang-sf commented Aug 4, 2023

xpang-sf commented Aug 4, 2023

samos123 commented Aug 4, 2023

xpang-sf commented Aug 4, 2023

AegeanYan commented Aug 8, 2023

hamidahmadian commented Sep 27, 2023

samuelhkahn commented Sep 28, 2023

samos123 commented Oct 19, 2023 •

edited

Loading

eternitybt commented Nov 15, 2023 •

edited

Loading

praharshbhatt commented Jun 21, 2024 •

edited

Loading

Cannot merge LORA layers when the model is loaded in 8-bit mode #29

Cannot merge LORA layers when the model is loaded in 8-bit mode #29

Comments

yangjianxin1 commented May 25, 2023

yangjianxin1 commented May 25, 2023

USBhost commented May 25, 2023 • edited Loading

bodaay commented May 25, 2023

KKcorps commented May 25, 2023

yangjianxin1 commented May 26, 2023

bodaay commented May 26, 2023

hemangjoshi37a commented May 28, 2023

larawehbe commented May 31, 2023

KiranNadig62 commented Jun 19, 2023

samos123 commented Jul 6, 2023

ashmitbhattarai commented Jul 11, 2023 • edited Loading

jocastrocUnal commented Jul 22, 2023

MrigankRaman commented Aug 2, 2023

larawehbe commented Aug 3, 2023

xpang-sf commented Aug 3, 2023

samos123 commented Aug 4, 2023

xpang-sf commented Aug 4, 2023

xpang-sf commented Aug 4, 2023

samos123 commented Aug 4, 2023

xpang-sf commented Aug 4, 2023

AegeanYan commented Aug 8, 2023

hamidahmadian commented Sep 27, 2023

samuelhkahn commented Sep 28, 2023

samos123 commented Oct 19, 2023 • edited Loading

eternitybt commented Nov 15, 2023 • edited Loading

praharshbhatt commented Jun 21, 2024 • edited Loading

USBhost commented May 25, 2023 •

edited

Loading

ashmitbhattarai commented Jul 11, 2023 •

edited

Loading

samos123 commented Oct 19, 2023 •

edited

Loading

eternitybt commented Nov 15, 2023 •

edited

Loading

praharshbhatt commented Jun 21, 2024 •

edited

Loading