Skip to content

Conversation

@ayourtch
Copy link

@ayourtch ayourtch commented Nov 9, 2025

Make sure to read the contributing guidelines before submitting a PR

claude added 24 commits November 8, 2025 11:11
This commit adds two new CLI options to the main tool:
- --dump-activations: Dumps intermediate layer activations to a GGUF file
- --load-activations: Loads and displays activations from a GGUF file

The implementation:
- Adds activation collection callback that captures tensors from various
  operations (MUL_MAT, ADD, MUL, NORM, RMS_NORM)
- Saves collected activations with metadata (version, count) to GGUF format
- Loads and displays activation tensor information from GGUF files

This is useful for debugging, analysis, and understanding model behavior
by examining intermediate activations during inference.
…sation

This commit adds two interactive commands that can be used during a chat session:
- /\/save <filename> - Triggers activation collection for the next inference pass
  and saves the collected activations to a GGUF file
- /\/load <filename> - Loads and displays activations from a GGUF file

Key features:
- Activations are collected only for one inference pass when /\/save is used
- The callback is automatically enabled in interactive mode
- Clear user feedback when activations are being collected and saved
- Helpful command information displayed when entering interactive mode

This allows users to interactively capture activations at specific points in a
conversation, making it much more flexible for debugging and analysis.

Example usage:
  > /\/save response1.gguf
  > What is the capital of France?
  [activations collected and saved after response]

  > /\/load response1.gguf
  [displays tensor information]
…d debugging

The previous implementation had a bug where after entering the /\/save or /\/load
command, the code would continue processing instead of looping back to wait for
the next user input.

Changes:
- Added explicit 'continue' after processing commands to loop back properly
- Set is_interacting=true to ensure we stay in interactive mode
- Improved user feedback messages (clearer instructions)
- Added debug logging to help diagnose activation collection issues:
  * Log when callback collects activations
  * Log activation count before saving
  * Better error messages if no activations collected

This fixes the issue where /\/save would trigger inference immediately
instead of waiting for the next user prompt.
The previous code was setting params.cb_eval AFTER the context was created
by common_init_from_params(params). This meant the context never received
the callback, so no activations were being collected.

This commit moves the callback setup to happen BEFORE context creation,
which is essential for the callback to be properly registered with the
llama context.

Changes:
- Moved callback setup before common_init_from_params() call
- Added debug logging to confirm callback is enabled in interactive mode
- Added comment explaining why the order matters

This fixes the issue where 0 activations were being collected because
the callback was never being triggered during inference.
This commit adds debugging to help diagnose why the callback isn't
being triggered:

1. Set params.cb_eval_user_data explicitly to nullptr
2. Disable warmup (params.warmup = false) as warmup may interfere
   with callbacks - following the pattern from imatrix tool
3. Add debug logging to show when callback is first invoked
4. Add debug logging when callback asks about tensors

This will help us understand if:
- The callback is being invoked at all
- The callback is seeing tensors but filtering them out
- There's an issue with the callback registration

To test with debug output, run:
  LLAMA_LOG_LEVEL=5 ./llama-cli -m model.gguf --interactive
This commit improves visibility into the activation collection process,
especially when using GPU offloading:

Changes:
1. Changed first callback log from LOG_DBG to LOG - now always visible
   to immediately show if callback is being invoked
2. Added collection counter to track how many activations are collected
3. Changed duplicate handling - instead of skipping, create unique names
   (tensor_1, tensor_2, etc.) to capture all activations
4. Added detailed logging showing if tensors are on CPU vs GPU
5. Added periodic progress logging every 10 collections
6. Improved debug output to show operation type

The code already handles GPU tensors correctly using ggml_backend_tensor_get()
to copy from GPU memory to host, following the same pattern as the imatrix tool.

This will help diagnose why activations might not be collected when using
GPU offloading.
ROOT CAUSE FOUND:
The callback was never being invoked because llama.cpp reuses the computation
graph on subsequent inferences for performance. When the graph is reused,
the callback is NOT set (see llama-context.cpp:778 - callback only set when
graph is rebuilt).

SOLUTION:
Set the environment variable LLAMA_GRAPH_REUSE_DISABLE=1 when activation
collection is enabled. This forces the graph to be rebuilt on every inference,
which ensures the callback gets set and invoked.

Changes:
1. Call setenv("LLAMA_GRAPH_REUSE_DISABLE", "1", 1) in both:
   - When --dump-activations is used
   - When in interactive mode (for /\/save command)

2. Added log message to inform users graph reuse is disabled

Note: Disabling graph reuse will slightly reduce performance, but this is
necessary for the callback mechanism to work. This only affects runs where
activation dumping is enabled.

This should finally fix the issue where 0 activations were being collected!
Replace callback-based approach with simpler state serialization:
- Remove all callback and graph reuse disabling code
- Use llama_state_get_data/set_data for complete state capture
- Save/restore: KV cache, RNG state, logits, embeddings
- No performance penalty from disabled optimizations

Features:
- CLI flags: --dump-activations, --load-activations
- Interactive: /\/save <file>, /\/load <file>
- GGUF format with metadata (version, size, type)
- Exact state restoration for conversation continuity

This provides the ability to save and restore the LLM to exactly
the same conversational state, enabling conversation checkpointing
and experimentation with different conversation branches.
Implements interactive temperature get/set functionality:
- /\/temp - displays current temperature setting
- /\/temp <value> - changes temperature on the fly

Technical implementation:
- Add common_sampler_get_temp() and common_sampler_set_temp() to sampling API
- Set temp works by removing old temperature sampler from chain and replacing it
- Preserves dynamic temperature settings (dynatemp_range, exponent) when set
- Validates temperature values (must be >= 0.0)

This allows users to experiment with different temperature values
during a conversation without restarting the program, enabling
exploration of how temperature affects model outputs in real-time.
The previous swapping logic to reposition the sampler in the chain
was buggy and not actually changing the temperature.

Fixed approach:
1. Remove old temperature sampler at position temp_idx
2. Collect all samplers that come after that position
3. Add new temperature sampler with new value
4. Add back all collected samplers in original order

This correctly preserves the sampler chain order while replacing
the temperature sampler, so temperature changes now actually take effect.

Users should now see dramatic differences in output randomness:
- temp 0.1: very deterministic
- temp 1.0: balanced
- temp 10.0+: increasingly random/creative
User reports temperature changes aren't taking effect.
Added detailed logging to trace:
- Sampler chain composition
- Temperature sampler detection
- Removal and replacement process
- Final chain state

This will help identify why temperature modifications
aren't affecting model output behavior.
Implements automatic KV cache persistence for debugging workflows:

New CLI flags:
- --kv-cache-auto-save <base-name>
  Automatically saves all slot KV caches to timestamped directory
  on server shutdown: <base-name>_YYYYMMDD_HHMMSS/

- --kv-cache-auto-load <dirname>
  Automatically loads all slot KV caches from specified directory
  on server startup

Implementation:
- auto_save_kv_cache(): Saves each non-empty slot to slot_N.bin
  in timestamped directory on server shutdown

- auto_load_kv_cache(): Loads each slot from slot_N.bin files
  on server initialization

- Uses llama_state_seq_save_file/load_file for per-slot persistence
- Integrated into server lifecycle: load after init(), save in cleanup

Usage example:
  # First run - build up KV cache state
  llama-server -m model.gguf --kv-cache-auto-save my_cache

  # Server shutdown creates: my_cache_20250108_143022/

  # Second run - restore state instantly
  llama-server -m model.gguf --kv-cache-auto-load my_cache_20250108_143022

This enables fast server restarts during debugging by preserving
the complete conversation/context state across sessions.
New endpoint: POST /save-kv-cache

Allows saving KV cache at any point during server operation,
not just on shutdown. Useful for creating checkpoints during
interactive debugging sessions.

Request body (optional):
{
  "dirname": "my_checkpoint"  // Custom directory name
}

Response:
{
  "success": true,
  "directory": "my_checkpoint",  // or timestamped if not specified
  "message": "KV cache saved successfully"
}

If dirname is not provided, automatically generates timestamped
directory name using --kv-cache-auto-save base name.

Implementation:
- Refactored auto_save_kv_cache() into save_kv_cache_to_dir(dirname)
- save_kv_cache_to_dir() accepts optional custom directory name
- Returns directory name on success, empty string on failure
- New endpoint handler parse JSON body and calls save function
- Registered at: POST /save-kv-cache

Usage examples:
  # Save with custom name
  curl -X POST http://localhost:8080/save-kv-cache \
    -H "Content-Type: application/json" \
    -d '{"dirname": "checkpoint_before_fix"}'

  # Save with auto-generated timestamp
  curl -X POST http://localhost:8080/save-kv-cache \
    -H "Content-Type: application/json" \
    -d '{}'
Implements a production-ready activation capture system that allows real-time
streaming of intermediate layer activations to disk for analysis:

- Queue-based async I/O with background writer thread to avoid blocking inference
- GPU tensor support via automatic ggml_backend_tensor_get() transfers
- Flexible filtering by regex patterns and layer ranges
- Binary file format (LLMACT01) with timestamped metadata entries
- Size limits to prevent unbounded disk usage
- HTTP endpoints:
  * POST /activations/start - Begin capture with filters and limits
  * POST /activations/stop - Stop capture and finalize output
  * GET /activations/status - Query current capture statistics

Implementation details:
- Callback set via params.cb_eval at model initialization
- Global pointer g_activation_capture enables thread-safe dynamic control
- Producer-consumer pattern with condition variables for queue management
- Atomic counters for bytes_written and entries_captured statistics

This enables debugging and analysis workflows like:
- Comparing activations between model versions
- Identifying problematic layers causing inference issues
- Analyzing attention patterns and intermediate representations
- Debugging quantization effects on specific layers
Implements a flexible tool-calling system that allows LLMs to invoke external
executables from a "tools" directory:

Features:
- LLM can request tool list by outputting <tools-help/>
  * Automatically scans "tools" directory for executables
  * Runs each with "help" parameter to collect usage info
  * Injects concatenated help text back into conversation

- LLM can execute tools by outputting <tool-launch>tool-name args</tool-launch>
  * Executes the specified tool from tools/ directory
  * Captures stdout/stderr and exit code
  * Injects output back into conversation for LLM to process

Implementation:
- Platform-specific: Full support on Unix/macOS, stub on Windows
- Uses popen() for command execution with output capture
- Alphabetically sorted tool listing for consistency
- Robust parsing of tool-launch tags with argument extraction
- Checks recent output buffer (128 tokens) for tag detection

Example tools directory structure:
  tools/
    calculator     (executable)
    web_search     (executable)
    file_reader    (executable)

This enables LLMs to:
- Access external data sources
- Perform calculations
- Query databases or APIs
- Interact with system utilities
- Extend capabilities without retraining

Security note: Only executables in the "tools" directory are accessible,
providing a sandboxed environment for tool execution.
Changed tool detection logic to use mutually exclusive checks (if/else)
instead of independent checks. This prevents tool help text containing
<tool-launch> examples from being accidentally executed.

Previously:
- Both <tools-help/> and <tool-launch> were checked in the same iteration
- If help text contained example usage like "<tool-launch>calc 2+2</tool-launch>",
  it would be detected and executed immediately after being injected

Now:
- Only one tool action is processed per iteration
- If <tools-help/> is detected, skip <tool-launch> check until next iteration
- Help examples remain as documentation without triggering execution
Fixes two critical issues with tool calling:

1. Think tag filtering:
   - Tool tags inside <think>...</think> are now ignored
   - Added is_inside_think_tags() to check if a position is within think blocks
   - Prevents accidental tool execution during model reasoning
   - Recursively searches for tool-launch tags outside think blocks

2. Duplicate execution prevention:
   - Tracks last executed tool signature (tool_name|args)
   - Skips re-execution if same tool call detected in buffer
   - Resets signature on new user input to allow reuse in conversation
   - Prevents multiple executions when tag remains in 128-token lookback window

Example scenarios now handled correctly:

Scenario 1 - Think tags:
  Model: <think>Maybe I should use <tool-launch>calc 2+2</tool-launch></think>
  Result: Tool NOT executed (inside think block)

  Model: Let me calculate this. <tool-launch>calc 2+2</tool-launch>
  Result: Tool executed (outside think block)

Scenario 2 - Duplicates:
  Model generates: <tool-launch>search foo</tool-launch>
  Iteration 1: Tool executed, output injected
  Iteration 2: Same tag still in buffer -> skipped
  User types new input
  Model generates: <tool-launch>search foo</tool-launch>
  Result: Tool executed again (signature reset on user input)

This ensures tools are only executed when the model explicitly intends to
use them outside of reasoning blocks, and each tool call executes exactly once.
Implements a feature to auto-submit empty input after a period of user
inactivity, allowing the agent to continue thinking without user interaction.

New CLI flag:
  --idle-action-interval N
    Auto-submit empty input after N minutes of idle time (default: 0 = disabled)

How it works:
- Tracks last activity timestamp (updated on any user input)
- Before waiting for user input, checks if idle interval has elapsed
- If idle timeout reached, automatically submits empty input
- Resets timer after auto-submission for next iteration
- Any keystroke/input from user resets the idle timer

Use cases:
- Agent continues reasoning/thinking during long idle periods
- Useful for autonomous workflows where agent should self-prompt
- Allows agent to work through complex problems without waiting

Example usage:
  llama-cli -m model.gguf --idle-action-interval 5
  (Agent will auto-submit empty input after 5 minutes of no user activity)

Implementation notes:
- Activity time tracked globally via g_last_activity_time
- Idle check happens when interactive mode waits for input
- Auto-submitted input is distinguishable from Ctrl+D (EOF)
- Console readline is bypassed when idle timeout triggers
- Timer resets on both manual and automatic input submission

Changes:
- common/common.h: Added idle_action_interval parameter
- common/arg.cpp: Added --idle-action-interval argument parser
- tools/main/main.cpp: Implemented idle timeout logic and tracking
Previously, the idle timer would continue running while the agent was
generating output, which could cause immediate auto-submission if the
agent took longer than the idle interval to respond.

Now the timer resets when transitioning to interactive mode (when
is_interacting becomes true), ensuring it only measures time spent
waiting for user input, not time spent generating.

Behavior before fix:
1. User enters input → timer updates (t=0)
2. Agent generates for 10 minutes → timer running (t=10m)
3. Agent finishes → idle check
4. If idle_interval=5m → triggers immediately (10m > 5m)

Behavior after fix:
1. User enters input → timer updates (t=0)
2. Agent generates for 10 minutes → timer running but will be reset
3. Agent finishes, enters interactive mode → timer resets (t=0)
4. User idle for 5 minutes → idle timeout triggers correctly (5m >= 5m)

The fix adds update_activity_time() call at line 1277 when entering
the "waiting for user input" state, right before displaying the prompt.
Previous implementation checked for timeout but then blocked indefinitely
on readline. Now uses select() (Unix) / WaitForSingleObject (Windows) to
check if input is available before blocking.

Changes:
- common/console.h: Added readline_with_timeout() function
- common/console.cpp: Implemented timeout using select()/WaitForSingleObject
- tools/main/main.cpp: Use readline_with_timeout with calculated timeout

How it works:
1. Calculate remaining timeout based on idle_action_interval and elapsed time
2. Call readline_with_timeout() with remaining seconds
3. If timeout occurs, auto-submit empty input
4. If user types anything, reset timer and disable timeout for continuation lines

Unix implementation:
- Uses select() on STDIN_FILENO with timeout
- Returns immediately if input available or timeout elapsed

Windows implementation:
- Uses WaitForSingleObject() on stdin handle with timeout
- Returns immediately if input available or timeout elapsed

This fixes the issue where idle timeout would never trigger because
readline() was blocking indefinitely waiting for input.
… thinking

Previously, tool deduplication signature was only reset on explicit user input,
preventing the LLM from reusing tools during idle-triggered thinking sessions.

Now the signature is also reset when idle timeout triggers auto-submission,
treating it as a new conversational turn where tools can be used again.

Behavior before:
1. LLM uses <tool-launch>calculator 5+5</tool-launch> → executes
2. User idles, timeout triggers → empty input submitted
3. LLM tries <tool-launch>calculator 10+10</tool-launch> → blocked (duplicate)

Behavior after:
1. LLM uses <tool-launch>calculator 5+5</tool-launch> → executes
2. User idles, timeout triggers → signature reset, empty input submitted
3. LLM tries <tool-launch>calculator 10+10</tool-launch> → executes (new turn)

This allows the agent to fully utilize tools during autonomous thinking sessions
triggered by the --idle-action-interval feature.
Adds an interactive command to view and change the idle action interval
without restarting llama-cli.

Usage:
  /\/timeout           - Show current timeout and disable if enabled
  /\/timeout <minutes> - Set idle timeout to N minutes
  /\/timeout 0         - Disable idle timeout

Behavior:
- /\/timeout with no args displays current setting and disables if enabled
- /\/timeout N sets the timeout to N minutes (0 = disabled)
- When setting a new non-zero timeout, timer resets immediately
- Complements --idle-action-interval CLI flag with runtime control

Example session:
  > /\/timeout
  Idle timeout is currently disabled (0 minutes)

  > /\/timeout 5
  Changing idle timeout from 0 to 5 minutes
  Idle timeout set to 5 minutes

  > /\/timeout
  Current idle timeout: 5 minutes
  Disabling idle timeout

This allows users to dynamically enable/disable autonomous agent thinking
during a session based on their workflow needs.
When idle timeout triggered with no user input, buffer was empty but
the code attempted to access buffer.back() which is undefined behavior.

The issue occurred because:
1. Timeout triggers with empty buffer
2. Condition `buffer.empty() && !timed_out` is FALSE (timed_out is true)
3. Code continues to access buffer.back() on empty string
4. Undefined behavior reads garbage memory
5. LLM sees random characters instead of empty input

This explains why the LLM reported receiving a "one-word string" instead
of empty input - it was reading uninitialized memory.

Fix: Add check to ensure buffer is not empty before accessing buffer.back()

Changed from:
  if (buffer.back() == '\n') {

To:
  if (!buffer.empty() && buffer.back() == '\n') {

Now idle timeout correctly sends truly empty input to the model.
Added LOG_DBG to show how many tokens are in tool output before injection.
This helps diagnose KV cache allocation failures with large contexts and
flash attention when tool output is injected.

Related to investigating "decode: failed to find a memory slot" errors
when using very large contexts (e.g., 96000) with flash attention enabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants