Bug: llama cpp server arg LLAMA_ARG_N_GPU_LAYERS doesn't follow the same convention as llama cpp python n_gpu_layers

### What happened?

If creating a llama model in python code, you can specific n_gpu_layers=-1 so that all layers are offloaded to GPU. (see below example) When starting llama cpp server using the docker image, setting LLAMA_ARG_N_GPU_LAYERS: -1 doesn't have the same functionality. 

```python
from llama_cpp import Llama

Llama('path/to/model', chat_format="llama-3", n_ctx=1024, n_gpu_layers=-1, verbose=False)
```

```yaml
llamacpp-server:
  image: ghcr.io/ggerganov/llama.cpp:server-cuda@sha256:fe887bd3debd1a55ddd95f067435a38166f15a058bf50fee173517b9831081c8
  ports:
    - 8080:8080
  volumes:
    # TODO: change
    - ./model:/model
  environment:
    # alternatively, you can use "LLAMA_ARG_MODEL_URL" to download the model
    LLAMA_ARG_MODEL: /model/path-to-model.gguf
    LLAMA_ARG_N_GPU_LAYERS: -1
```

### Name and Version

From the prebuilt docker image ghcr.io/ggerganov/llama.cpp:server-cuda@sha256:fe887bd3debd1a55ddd95f067435a38166f15a058bf50fee173517b9831081c8

version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell
llamacpp-server-1  | ggml_cuda_init: found 1 CUDA devices:
llamacpp-server-1  |   Device 0: Tesla T4, compute capability 7.5, VMM: yes
llamacpp-server-1  | llm_load_tensors: ggml ctx size =    0.14 MiB
llamacpp-server-1  | llm_load_tensors: offloading 0 repeating layers to GPU
llamacpp-server-1  | llm_load_tensors: offloaded 0/33 layers to GPU
llamacpp-server-1  | llm_load_tensors:        CPU buffer size =  6282.97 MiB
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: llama cpp server arg LLAMA_ARG_N_GPU_LAYERS doesn't follow the same convention as llama cpp python n_gpu_layers #9556

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: llama cpp server arg LLAMA_ARG_N_GPU_LAYERS doesn't follow the same convention as llama cpp python n_gpu_layers #9556

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions