android: Fix infinite generation by updating stop_generation_position and implement attention sinks#18476
Conversation
…_context() Properly maintain stop_generation_position when context is shifted. Additionally, add variable attention_sink to maintain attention. Adjust the position of the function after the variable stop_generation_position is defined. Fixes ggml-org#18409.
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical bug in the Android example's context shifting implementation that caused infinite text generation, and enhances the mechanism with attention sink support to maintain model stability during long conversations.
Key Changes
- Fixed infinite generation loop by updating
stop_generation_positionwhen context is shifted - Implemented attention sink mechanism to preserve initial tokens and prevent perplexity degradation
- Relocated
shift_context()function definition after short-term state variables for proper scope access
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * | ||
| * Context shifting by discarding the older half of the tokens appended after system prompt: | ||
| * - take the [system_prompt_position] first tokens from the original prompt | ||
| * - take half of the last (system_prompt_position - system_prompt_position) tokens |
There was a problem hiding this comment.
The comment contains a documentation error: "system_prompt_position - system_prompt_position" appears to reference the same variable twice. Based on the code logic below, this should likely read "current_position - system_prompt_position" to indicate taking half of the tokens between the system prompt position and the current position.
| * - take half of the last (system_prompt_position - system_prompt_position) tokens | |
| * - take half of the last (current_position - system_prompt_position) tokens |
| * - recompute the logits in batches | ||
| */ | ||
| static void shift_context() { | ||
| const int attention_sink = 4; |
There was a problem hiding this comment.
The magic number 4 for attention_sink lacks explanation. Consider adding a comment explaining why 4 tokens are chosen as the attention sink, or defining this as a named constant (e.g., ATTENTION_SINK_SIZE) with documentation about its purpose in maintaining model stability during context shifting.
|
@naco-siren 我刚提交了一个PR,请问这是否符合规范?此外,@ssam18 的修改仍然存在问题,具体请参考上面的详细描述。 |
Updated comments in shift_context function to clarify token handling.
Summary
This PR addresses a logical error in the Android
ai_chat.cppexample where context shifting caused infinite text generation. It also enhances the context management by implementing an Attention Sink mechanism to maintain model stability during long-stream generation.Changes
shift_context()to properly subtractn_discardfromstop_generation_position.attention_sinkto preserve the initial tokens in the KV cache.shift_context()function definition to appear after the declaration ofstop_generation_positionand other short-term state variables.Attribution
stop_generation_positionis defined after the functionshift_context().Fixes #18409