Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I was trying to make Mixtral chat completion work with llama-cpp-python (either directly in Python and Server) independently of the given prompt.
I defined two kinds of prompt:
SHORT_PROMPT
{
"messages": [
{"role": "system", "content": "This is a short prompt."},
{"role": "user", "content": "This is a short prompt."}
]
}
LONG_PROMPT
{
"messages": [
{"role": "system", "content": "This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. "},
{"role": "user", "content": "This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. "}
]
}
I would expect that the model answers something independently of what prompt I give.
Current Behavior
If I
- Start the server
- Send the
SHORT_PROMPT (or any other short prompt) and wait for result
- Send the
LONG_PROMPT and wait for result
everything is ok (= the model answers quickly)
If I
- Start the server
- Send the
LONG_PROMPT and wait for result
It hangs seemingly forever, with gpu utilization at 0% (see below)
If I send exponentially longer prompts each time, it sometime hangs at ~1200 characters (json structure included), other times at ~3000.
I tried:
- Waiting for about 20/30m.
- I tried by sending a request to the server or using the UI at
localhost:8123/docs, same result.
- Also instantiating
Lama() from llama-cpp-python directly in python, using the same parameters, same results.
- Using
stream=True, nothing is sent back while it hangs
- Changing context size to
2000. Seems not to change much.
When process hangs:
top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2668867 user 20 0 111,8g 42,6g 41,7g S 100,3 33,9 0:45.34 python3
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 On | Off |
| 36% 66C P2 85W / 300W | 22124MiB / 49140MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:4B:00.0 Off | Off |
| 30% 57C P2 86W / 300W | 19902MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Environment and Context
- Physical (or virtual) hardware you are using, e.g. for Linux:
AMD Ryzen Threadripper 3960X 24-Core Processor
GPU: 2x A6000 49Gb
Linux redacted 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Steps to Reproduce
- Installed llama-cpp-gpu with
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install 'llama-cpp-python[server]' --upgrade --force-reinstall --no-cache-dir
- I run the server with ``python3 -m llama_cpp.server --n_gpu_layers 35 --model "<my_model_fodler>/mixtral-8x7b-instruct-v0.1.Q6_K.gguf" --host 0.0.0.0 --port 8123 --n_ctx 30000"
- See Current behaviour.
Failure Logs
No log is produced since the process hangs indefinitely.
Example environment info:
llama-cpp-python$ python3 --version
Python 3.10.12
llama-cpp-python$ pip list | egrep "llama|uvicorn|fastapi|sse-starlette|numpy"
fastapi 0.105.0
llama_cpp_python 0.2.24
numpy 1.26.2
sse-starlette 1.8.2
uvicorn 0.24.0.post1
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I was trying to make Mixtral chat completion work with llama-cpp-python (either directly in Python and Server) independently of the given prompt.
I defined two kinds of prompt:
SHORT_PROMPT
{ "messages": [ {"role": "system", "content": "This is a short prompt."}, {"role": "user", "content": "This is a short prompt."} ] }LONG_PROMPT
{ "messages": [ {"role": "system", "content": "This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. "}, {"role": "user", "content": "This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. "} ] }I would expect that the model answers something independently of what prompt I give.
Current Behavior
If I
SHORT_PROMPT(or any other short prompt) and wait for resultLONG_PROMPTand wait for resulteverything is ok (= the model answers quickly)
If I
LONG_PROMPTand wait for resultIt hangs seemingly forever, with gpu utilization at 0% (see below)
If I send exponentially longer prompts each time, it sometime hangs at ~1200 characters (json structure included), other times at ~3000.
I tried:
localhost:8123/docs, same result.Lama()from llama-cpp-python directly in python, using the same parameters, same results.stream=True, nothing is sent back while it hangs2000. Seems not to change much.When process hangs:
Environment and Context
AMD Ryzen Threadripper 3960X 24-Core ProcessorGPU: 2x A6000 49GbLinux redacted 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/LinuxSteps to Reproduce
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install 'llama-cpp-python[server]' --upgrade --force-reinstall --no-cache-dirFailure Logs
No log is produced since the process hangs indefinitely.
Example environment info: