Bug: LLAMA_MAX_LAYERS must be increased to run FatLlama 1.7T

### What happened?

When trying to run [FatLlama-1.7T-Instruct](https://huggingface.co/RichardErkhov/FATLLAMA-1.7T-Instruct) llama.cpp crashes while loading the model with the error: `n > N_MAX: 525 > 512 for key llama.feed_forward_length`. This is because the model has 525 layers but LLAMA_MAX_LAYERS  is hardcoded to 512.

In order to run FetLlama 1.7T it is required to bump LLAMA_MAX_LAYERS to 525 or larger and recompile llama.cpp. While this is relatively easy if you use llama.cpp directly as soon you deal with 3rd party software using backend specific pre-built llama-cpp-python bindings (like oobabooga/text-generation-webui) changing LLAMA_MAX_LAYERS gets unfeasible for the general user.

To fix this issue LLAMA_MAX_LAYERS could be increased or be made dynamic.

### Name and Version

version: 3927 (10433e8b)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell
root@AI:/bpool# ./llama.cpp/llama-cli -m FATLLAMA-1.7T-Instruct.SOURCE.gguf -p "I believe the meaning of life is" -n 128
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 3927 (10433e8b) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 24 key-value pairs and 4729 tensors from FATLLAMA-1.7T-Instruct.SOURCE.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Merged
llama_model_loader: - kv   3:                         general.size_label str              = 1.7T
llama_model_loader: - kv   4:                          llama.block_count u32              = 525
llama_model_loader: - kv   5:                       llama.context_length u32              = 131072
llama_model_loader: - kv   6:                     llama.embedding_length u32              = 16384
llama_model_loader: - kv   7:                  llama.feed_forward_length u32              = 53248
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 128
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 16
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 1
llama_model_loader: - kv  13:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  14:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", ...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32: 1052 tensors
llama_model_loader: - type  f16: 3677 tensors
llama_model_load: error loading model: error loading model hyperparameters: n > N_MAX: 525 > 512 for key llama.feed_forward_length
llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model 'FATLLAMA-1.7T-Instruct.SOURCE.gguf'
main: error: unable to load model
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: LLAMA_MAX_LAYERS must be increased to run FatLlama 1.7T #9909

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: LLAMA_MAX_LAYERS must be increased to run FatLlama 1.7T #9909

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions