Feature Request: Add support for Phi-4 model

### Prerequisites

- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Microsoft has released a new Phi-4 14B model. So far it's available only on [Azure AI Foundry](https://ai.azure.com/explore/models/Phi-4/version/1/registry/azureml#artifacts), in a few days it will appear on HuggingFace.

### Motivation

The model is advertised as having strong reasoning abilities despite its relatively small size. It would be great to have it supported in llama.cpp.

### Possible Implementation

The model uses Phi3ForCausalLM architecture that is already supported in llama.cpp. The differences I noticed that cause problems are:

1. It uses `GPT2Tokenizer` tokenizer_class, not `LlamaTokenizer` like the previous Phi models. The `convert_hf_to_gguf.py` script expects `Phi3ForCausalLM`-based models to have SentencePiece `tokenizer.model` file and throws exception if it's not present. It has to be modified to support Phi-4.
2. The model has `sliding_window` parameter value set to `null` in config.json. [Phi-4 Technical Report](https://arxiv.org/html/2412.08905v1) says:

> The phi-4 model is based on a decoder-only transformer architecture with 14B parameters and a default context length of 4096. This is later extended to a 16K context length during midtraining. The architecture closely follows phi-3-medium, except that we now use the tiktoken tokenizer (for better multilingual support) with a padded vocabulary size of 100,352 (including unused tokens) and we use full attention over the 4K context length, rather than a 2K sliding window used in phi-3-medium

My initial solution for the 1st problem was:
```
diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index c63d929c..1ae37b83 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -2129,6 +2129,9 @@ class Phi3MiniModel(Model):
     model_arch = gguf.MODEL_ARCH.PHI3
 
     def set_vocab(self):
+        if self.metadata.name == "Phi 4":
+            return self._set_vocab_gpt2()
+
         from sentencepiece import SentencePieceProcessor
 
         tokenizer_path = self.dir_model / 'tokenizer.model'
```

As for the second problem, I manually changed `sliding_window` parameter value to the max context length (16384) in config.json before conversion. This allowed me to test the model. I suppose the final implementation shall detect presence of Phi 4 model and build full KQ mask instead of sliding window KQ mask.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Add support for Phi-4 model #10814

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Add support for Phi-4 model #10814

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions