android: Fix infinite generation by updating stop_generation_position and implement attention sinks by oopb · Pull Request #18476 · ggml-org/llama.cpp

oopb · 2025-12-30T03:33:24Z

Summary

This PR addresses a logical error in the Android ai_chat.cpp example where context shifting caused infinite text generation. It also enhances the context management by implementing an Attention Sink mechanism to maintain model stability during long-stream generation.

Changes

Bug Fix for Infinite Output:

Updated shift_context() to properly subtract n_discard from stop_generation_position.
Previously, when the KV cache was reduced, the generation would never hit the original stop position, leading to an infinite loop.

Attention Sink Implementation:

Introduced a new variable attention_sink to preserve the initial tokens in the KV cache.
This ensures the model maintains a stable "attention anchor", preventing performance degradation (perplexity explosion) when older tokens are discarded.

Code Refactoring:

Relocated the shift_context() function definition to appear after the declaration of stop_generation_position and other short-term state variables.
This ensures the function has the correct scope and access to these variables without requiring forward declarations.

Attribution

This issue was identified and reported by @oopb in Misc. bug: examples/llama.android: A logical error in the function shift_context() #18409.
Thank for quick fix by @ssam18, but your version is not working because the variable stop_generation_position is defined after the function shift_context().

Fixes #18409

…_context() Properly maintain stop_generation_position when context is shifted. Additionally, add variable attention_sink to maintain attention. Adjust the position of the function after the variable stop_generation_position is defined. Fixes ggml-org#18409.

Copilot

Pull request overview

This PR fixes a critical bug in the Android example's context shifting implementation that caused infinite text generation, and enhances the mechanism with attention sink support to maintain model stability during long conversations.

Key Changes

Fixed infinite generation loop by updating stop_generation_position when context is shifted
Implemented attention sink mechanism to preserve initial tokens and prevent perplexity degradation
Relocated shift_context() function definition after short-term state variables for proper scope access

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-30T03:36:28Z

examples/llama.android/lib/src/main/cpp/ai_chat.cpp

+ *
+ * Context shifting by discarding the older half of the tokens appended after system prompt:
+ * - take the [system_prompt_position] first tokens from the original prompt
+ * - take half of the last (system_prompt_position - system_prompt_position) tokens


The comment contains a documentation error: "system_prompt_position - system_prompt_position" appears to reference the same variable twice. Based on the code logic below, this should likely read "current_position - system_prompt_position" to indicate taking half of the tokens between the system prompt position and the current position.

Suggested change

* - take half of the last (system_prompt_position - system_prompt_position) tokens

* - take half of the last (current_position - system_prompt_position) tokens

Copilot · 2025-12-30T03:36:29Z

examples/llama.android/lib/src/main/cpp/ai_chat.cpp

+ * - recompute the logits in batches
+ */
+static void shift_context() {
+    const int attention_sink = 4;


The magic number 4 for attention_sink lacks explanation. Consider adding a comment explaining why 4 tokens are chosen as the attention sink, or defining this as a named constant (e.g., ATTENTION_SINK_SIZE) with documentation about its purpose in maintaining model stability during context shifting.

oopb · 2025-12-30T03:37:08Z

@naco-siren 我刚提交了一个PR，请问这是否符合规范？此外，@ssam18 的修改仍然存在问题，具体请参考上面的详细描述。

Updated comments in shift_context function to clarify token handling.

oopb requested a review from ggerganov as a code owner December 30, 2025 03:33

Copilot AI review requested due to automatic review settings December 30, 2025 03:33

github-actions bot added android Issues specific to Android examples labels Dec 30, 2025

Copilot started reviewing on behalf of oopb December 30, 2025 03:33 View session

Copilot AI reviewed Dec 30, 2025

View reviewed changes

loci-dev mentioned this pull request Dec 30, 2025

UPSTREAM PR #18476: android: Fix infinite generation by updating stop_generation_position and implement attention sinks auroralabs-loci/llama.cpp#753

Open

oopb marked this pull request as draft December 30, 2025 07:59

Clarify token handling in shift_context comments

2d55904

Updated comments in shift_context function to clarify token handling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

android: Fix infinite generation by updating stop_generation_position and implement attention sinks#18476

android: Fix infinite generation by updating stop_generation_position and implement attention sinks#18476
oopb wants to merge 2 commits intoggml-org:masterfrom
oopb:Android-fix-infinite-generation-and-maintain-attention

oopb commented Dec 30, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 30, 2025

Uh oh!

Copilot AI Dec 30, 2025

Uh oh!

oopb commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	* - take half of the last (system_prompt_position - system_prompt_position) tokens
	* - take half of the last (current_position - system_prompt_position) tokens

Conversation

oopb commented Dec 30, 2025

Summary

Changes

Attribution

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

oopb commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants