Skip to content

Mixtral hanging if the first prompt is "too long" #1033

@edoardogiacomello

Description

@edoardogiacomello

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I was trying to make Mixtral chat completion work with llama-cpp-python (either directly in Python and Server) independently of the given prompt.

I defined two kinds of prompt:

SHORT_PROMPT

{
  "messages": [
{"role": "system", "content": "This is a short prompt."}, 
{"role": "user", "content": "This is a short prompt."}
]
}

LONG_PROMPT

{
  "messages": [
{"role": "system", "content": "This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. "}, 
{"role": "user", "content": "This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. "}
]
}

I would expect that the model answers something independently of what prompt I give.

Current Behavior

If I

  1. Start the server
  2. Send the SHORT_PROMPT (or any other short prompt) and wait for result
  3. Send the LONG_PROMPT and wait for result
    everything is ok (= the model answers quickly)

If I

  1. Start the server
  2. Send the LONG_PROMPT and wait for result
    It hangs seemingly forever, with gpu utilization at 0% (see below)

If I send exponentially longer prompts each time, it sometime hangs at ~1200 characters (json structure included), other times at ~3000.

I tried:

  • Waiting for about 20/30m.
  • I tried by sending a request to the server or using the UI at localhost:8123/docs, same result.
  • Also instantiating Lama() from llama-cpp-python directly in python, using the same parameters, same results.
  • Using stream=True, nothing is sent back while it hangs
  • Changing context size to 2000. Seems not to change much.

When process hangs:

top:
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
2668867 user      20   0  111,8g  42,6g  41,7g S 100,3  33,9   0:45.34 python3  
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:01:00.0  On |                  Off |
| 36%   66C    P2    85W / 300W |  22124MiB / 49140MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    Off  | 00000000:4B:00.0 Off |                  Off |
| 30%   57C    P2    86W / 300W |  19902MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Environment and Context

  • Physical (or virtual) hardware you are using, e.g. for Linux:

AMD Ryzen Threadripper 3960X 24-Core Processor
GPU: 2x A6000 49Gb
Linux redacted 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Steps to Reproduce

  • Installed llama-cpp-gpu with CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install 'llama-cpp-python[server]' --upgrade --force-reinstall --no-cache-dir
  • I run the server with ``python3 -m llama_cpp.server --n_gpu_layers 35 --model "<my_model_fodler>/mixtral-8x7b-instruct-v0.1.Q6_K.gguf" --host 0.0.0.0 --port 8123 --n_ctx 30000"
  • See Current behaviour.

Failure Logs

No log is produced since the process hangs indefinitely.

Example environment info:


llama-cpp-python$ python3 --version
Python 3.10.12

llama-cpp-python$ pip list | egrep "llama|uvicorn|fastapi|sse-starlette|numpy"
fastapi                   0.105.0
llama_cpp_python          0.2.24
numpy                     1.26.2
sse-starlette             1.8.2
uvicorn                   0.24.0.post1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions