Mixtral hanging if the first prompt is "too long"

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

I was trying to make Mixtral chat completion work with llama-cpp-python (either directly in Python and Server) independently of the given prompt.


I defined two kinds of prompt:

SHORT_PROMPT 
```json
{
  "messages": [
{"role": "system", "content": "This is a short prompt."}, 
{"role": "user", "content": "This is a short prompt."}
]
}
```

LONG_PROMPT
```json
{
  "messages": [
{"role": "system", "content": "This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. "}, 
{"role": "user", "content": "This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. This is a long prompt. "}
]
}
```

I would expect that the model answers something independently of what prompt I give.

# Current Behavior

If I
1) Start the server
2) Send the `SHORT_PROMPT` (or any other short prompt)  and wait for result
3) Send the `LONG_PROMPT` and wait for result
everything is ok (= the model answers quickly)

If I
1) Start the server
2) Send the `LONG_PROMPT` and wait for result
It hangs seemingly forever, with gpu utilization at 0% (see below)

If I send exponentially longer prompts each time, it sometime hangs at ~1200 characters (json structure included), other times at ~3000.

I tried:
- Waiting for about 20/30m.
- I tried by sending a request to the server or using the UI at `localhost:8123/docs`, same result.
- Also instantiating `Lama()` from llama-cpp-python directly in python, using the same parameters, same results.
- Using `stream=True`, nothing is sent back while it hangs
- Changing context size to `2000`. Seems not to change much.

When process hangs:
```
top:
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
2668867 user      20   0  111,8g  42,6g  41,7g S 100,3  33,9   0:45.34 python3  
```

```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:01:00.0  On |                  Off |
| 36%   66C    P2    85W / 300W |  22124MiB / 49140MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    Off  | 00000000:4B:00.0 Off |                  Off |
| 30%   57C    P2    86W / 300W |  19902MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
```



# Environment and Context

* Physical (or virtual) hardware you are using, e.g. for Linux:

`AMD Ryzen Threadripper 3960X 24-Core Processor`
`GPU: 2x A6000 49Gb`
`Linux redacted 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux`

# Steps to Reproduce

- Installed llama-cpp-gpu with `CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install 'llama-cpp-python[server]'  --upgrade --force-reinstall --no-cache-dir`
- I run the server with  ``python3 -m llama_cpp.server  --n_gpu_layers 35 --model "<my_model_fodler>/mixtral-8x7b-instruct-v0.1.Q6_K.gguf" --host 0.0.0.0 --port 8123 --n_ctx 30000"
- See Current behaviour.

# Failure Logs

No log is produced since the process hangs indefinitely.

Example environment info:
```

llama-cpp-python$ python3 --version
Python 3.10.12

llama-cpp-python$ pip list | egrep "llama|uvicorn|fastapi|sse-starlette|numpy"
fastapi                   0.105.0
llama_cpp_python          0.2.24
numpy                     1.26.2
sse-starlette             1.8.2
uvicorn                   0.24.0.post1


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixtral hanging if the first prompt is "too long" #1033

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Mixtral hanging if the first prompt is "too long" #1033

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions