Bug: Gemma2 adapter weights `lm_head` skipped on gguf conversion

### What happened?

The `lm_head` layer for a [Gemma2](https://huggingface.co/google/gemma-2-2b) LoRA adapter is not converted by `convert_lora_to_gguf.py`, and therefore not applied at inference (ruining performance of the adapter).

 

### How to reproduce:


<details>
<summary>Expand</summary>

 

1. LoRA fine-tune Gemma2 with `pytorch`/`peft` including `lm_head` in the `target_modules` param:
 ```python
 config = LoraConfig(target_modules=["lm_head"], ...)
 ``` 
2. Save the adapter.
3. Convert the adapter debugging 
 ```bash
 python convert_lora_to_gguf.py <adapter folder> --base <base model folder> --outtype f32
 ```
 then the `lm_head` layer is skipped by [this line in `convert_hf_to_gguf.py`](https://github.com/ggerganov/llama.cpp/blob/4b9afbbe9037f8a2d659097c0c7d9fce32c6494c/convert_hf_to_gguf.py#L2648) (and no error is raised):
 ```python
 if name == "lm_head.weight":
 logger.debug(f"Skipping get tensor {name!r} in safetensors so that convert can end normally.")
 return []
 ```
4. Run `llama-cli` to check that indeed no lora layer is applied in the [respective line in llama.cpp](https://github.com/ggerganov/llama.cpp/blob/8b3befc0e2ed8fb18b903735831496b8b0c80949/src/llama.cpp#L12021):
 ```bash
 ./llama-cli -m base/model/path/Base-F32.gguf \
 --lora lora/model/path/Lora-F32-LoRA.gguf \
 -p "Hello Gemma2" -n 50
 ```


</details>

 

### Expected behaviour

I think this is a bug because a user might have trained an adapter that is applied to the the `lm_head` layer, so skipping it on conversion will destroy the adapter's performance. I think the code should either:
- **raise** an error saying `Cannot convert Gemma2 adapter with lm_head layer`

or

- **handle the lm_head layer** (although it might be tricky for merging adapters as the `lm_head` layer shares the weights with the `embed` layer in Gemma2, probably leading to having to create a new tensor for the `lm_head` to merge the adapter to).
 

### Comments

- I think the script `convert_lora_to_gguf.py` was introduced in PR #8332, so maybe the @ngxson knows if skipping the `lm_head` is the desired outcome of if it is actually a bug. Otherwise I'm happy to try figure out why this happens.
- This is not the case for, say, Phi3, which converts the `lm_head` lora layer correctly.
- I can provide more code/models to reproduce the bug easily if that helps.

 


### Name and Version

version: 3524 (bc0f887e)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0

### What operating system are you seeing the problem on?

MacOS, but it should be a platform-independent problem.

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Gemma2 adapter weights `lm_head` skipped on gguf conversion #9065

What happened?

How to reproduce:

Expected behaviour

Comments

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Gemma2 adapter weights lm_head skipped on gguf conversion #9065

Description

What happened?

How to reproduce:

Expected behaviour

Comments

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug: Gemma2 adapter weights `lm_head` skipped on gguf conversion #9065