server: fix the disappearance of the end of the text when streaming with stop strings by z80maniac · Pull Request #9867 · ggml-org/llama.cpp

z80maniac · 2024-10-12T15:50:22Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

The problem: the end of the text may not be pushed to the API-client when streaming with stop strings enabled.

For example, let's test the following prompt:

Alice: Ask me any question.
Bob: What color is the sky on Mars

The stop string will be \n (stopping at the end of a dialog line). The next predicted character should be the question mark (it actually depends on the model and the probabilities, but in this case, let's assume that the seed is chosen so that the next character is ?).

The bug depends heavily on the model's tokenizer. In my case the bug can be seen on Llama-3.2-3B-Instruct-Q8_0.gguf model.

First, let's try it without streaming:

curl -Ss --data '{"prompt": "Alice: Ask me any question.\nBob: What color is the sky on Mars", "n_predict": 8, "cache_prompt": true, "stop": ["\n"], "seed": 42}' http://127.0.0.1:8080/completion | jq .content

It gives the correct result:

"?"

Now let's try it with streaming:

curl -Ss --data '{"prompt": "Alice: Ask me any question.\nBob: What color is the sky on Mars", "n_predict": 8, "cache_prompt": true, "stop": ["\n"], "seed": 42, "stream": true}' http://127.0.0.1:8080/completion

data: {"content":"","stop":false,"id_slot":0,"multimodal":false,"index":0}

data: {"content":"","id_slot":0,"stop":true,"model":"/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf","tokens_predicted":1,"tokens_evaluated":17,"generation_settings":{"n_ctx":8192,"n_predict":-1,"model":"/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf","seed":42,"seed_cur":42,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":["\n"],"max_tokens":8,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},"prompt":"Alice: Ask me any question.\nBob: What color is the sky on Mars","truncated":false,"stopped_eos":false,"stopped_word":true,"stopped_limit":false,"stopping_word":"\n","tokens_cached":17,"timings":{"prompt_n":1,"prompt_ms":30.498,"prompt_per_token_ms":30.498,"prompt_per_second":32.7890353465801,"predicted_n":1,"predicted_ms":0.028,"predicted_per_token_ms":0.028,"predicted_per_second":35714.28571428571},"index":0}

The question mark disappeared. And it's also not featured in the stopping_word. So it's completely lost and the API-client won't be able to restore it.

It happens because the next token returned by the model contains both a question mark and the stop string: ?\n\n. And the current code skips sending the current token completely if it contains the stop string.

The change in this PR sends the remainder of the token to the API-client in this case. When is_stop_full=true it's safe to send the response, because the stop string and everything after it will already be truncated from the generated_text at this point.

The response with this PR applied:

data: {"content":"?","stop":false,"id_slot":0,"multimodal":false,"index":0}

data: {"content":"","id_slot":0,"stop":true,"model":"/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf","tokens_predicted":1,"tokens_evaluated":17,"generation_settings":{"n_ctx":8192,"n_predict":-1,"model":"/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf","seed":42,"seed_cur":42,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":["\n"],"max_tokens":8,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},"prompt":"Alice: Ask me any question.\nBob: What color is the sky on Mars","truncated":false,"stopped_eos":false,"stopped_word":true,"stopped_limit":false,"stopping_word":"\n","tokens_cached":17,"timings":{"prompt_n":1,"prompt_ms":17.814,"prompt_per_token_ms":17.814,"prompt_per_second":56.13562366677894,"predicted_n":1,"predicted_ms":0.055,"predicted_per_token_ms":0.055,"predicted_per_second":18181.81818181818},"index":0}

…ith stop strings

ggerganov · 2024-10-14T07:17:34Z


            // check if there is any token to predict
-            if (stop_pos == std::string::npos || (!slot.has_next_token && !is_stop_full && stop_pos > 0)) {
+            if (stop_pos == std::string::npos || is_stop_full || (!slot.has_next_token && !is_stop_full && stop_pos > 0)) {


I think this if will always evaluate to true:

if stop_pos == std::string::npos -> true

if stop_pos != std::string::npos, then is_stop_full == true due to line 1075 -> true

I have attempted to simplify this logic in the new commit. The text will not be sent only when the partial match is found. And partial match will be searched for only when the current token is not the last one.

* server: fix the disappearance of the end of the text when streaming with stop strings * simplify "send text" checks

server: fix the disappearance of the end of the text when streaming w…

ae996a2

…ith stop strings

github-actions Bot added examples server labels Oct 12, 2024

ggerganov reviewed Oct 14, 2024

View reviewed changes

z80maniac added 2 commits October 14, 2024 21:53

simplify "send text" checks

40a68f4

Merge remote-tracking branch 'upstream/master' into fix-stop-trim

6e13d3f

ggerganov approved these changes Oct 16, 2024

View reviewed changes

ggerganov merged commit 1f66b69 into ggml-org:master Oct 16, 2024

ThiloteE mentioned this pull request Oct 17, 2024

Support GPT4All Server API JabRef/jabref#11870

Closed

someone13574 mentioned this pull request Oct 18, 2024

Last character being truncated by stop sequence ollama/ollama#7256

Closed

drollings pushed a commit to drollings/llama.cpp that referenced this pull request Oct 18, 2024

server : fix the disappearance of the end of the text (ggml-org#9867)

aca8957

* server: fix the disappearance of the end of the text when streaming with stop strings * simplify "send text" checks

someone13574 mentioned this pull request Oct 18, 2024

Fix truncation of end of text when streaming with stop strings ollama/ollama#7263

Closed

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

server : fix the disappearance of the end of the text (ggml-org#9867)

c0c80ff

* server: fix the disappearance of the end of the text when streaming with stop strings * simplify "send text" checks

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

server : fix the disappearance of the end of the text (ggml-org#9867)

7206960

* server: fix the disappearance of the end of the text when streaming with stop strings * simplify "send text" checks

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

server : fix the disappearance of the end of the text (ggml-org#9867)

99640be

* server: fix the disappearance of the end of the text when streaming with stop strings * simplify "send text" checks

firecoperana mentioned this pull request Oct 27, 2025

Fix text completions not stop when using stop strings during streaming ikawrakow/ik_llama.cpp#870

Merged

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

server : fix the disappearance of the end of the text (ggml-org#9867)

b6cf319

* server: fix the disappearance of the end of the text when streaming with stop strings * simplify "send text" checks

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

server : fix the disappearance of the end of the text (ggml-org#9867)

d7c290f

* server: fix the disappearance of the end of the text when streaming with stop strings * simplify "send text" checks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: fix the disappearance of the end of the text when streaming with stop strings#9867

server: fix the disappearance of the end of the text when streaming with stop strings#9867
ggerganov merged 3 commits into
ggml-org:masterfrom
z80maniac:fix-stop-trim

z80maniac commented Oct 12, 2024

Uh oh!

ggerganov Oct 14, 2024

Uh oh!

z80maniac Oct 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

z80maniac commented Oct 12, 2024

Uh oh!

ggerganov Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

z80maniac Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants