Claude/add activation dump load cli 011 c uv j vgmad UI a6mq n8pbaq #17130

ayourtch · 2025-11-09T22:56:33Z

Make sure to read the contributing guidelines before submitting a PR

This commit adds two new CLI options to the main tool: - --dump-activations: Dumps intermediate layer activations to a GGUF file - --load-activations: Loads and displays activations from a GGUF file The implementation: - Adds activation collection callback that captures tensors from various operations (MUL_MAT, ADD, MUL, NORM, RMS_NORM) - Saves collected activations with metadata (version, count) to GGUF format - Loads and displays activation tensor information from GGUF files This is useful for debugging, analysis, and understanding model behavior by examining intermediate activations during inference.

…sation This commit adds two interactive commands that can be used during a chat session: - /\/save <filename> - Triggers activation collection for the next inference pass and saves the collected activations to a GGUF file - /\/load <filename> - Loads and displays activations from a GGUF file Key features: - Activations are collected only for one inference pass when /\/save is used - The callback is automatically enabled in interactive mode - Clear user feedback when activations are being collected and saved - Helpful command information displayed when entering interactive mode This allows users to interactively capture activations at specific points in a conversation, making it much more flexible for debugging and analysis. Example usage: > /\/save response1.gguf > What is the capital of France? [activations collected and saved after response] > /\/load response1.gguf [displays tensor information]

…d debugging The previous implementation had a bug where after entering the /\/save or /\/load command, the code would continue processing instead of looping back to wait for the next user input. Changes: - Added explicit 'continue' after processing commands to loop back properly - Set is_interacting=true to ensure we stay in interactive mode - Improved user feedback messages (clearer instructions) - Added debug logging to help diagnose activation collection issues: * Log when callback collects activations * Log activation count before saving * Better error messages if no activations collected This fixes the issue where /\/save would trigger inference immediately instead of waiting for the next user prompt.

The previous code was setting params.cb_eval AFTER the context was created by common_init_from_params(params). This meant the context never received the callback, so no activations were being collected. This commit moves the callback setup to happen BEFORE context creation, which is essential for the callback to be properly registered with the llama context. Changes: - Moved callback setup before common_init_from_params() call - Added debug logging to confirm callback is enabled in interactive mode - Added comment explaining why the order matters This fixes the issue where 0 activations were being collected because the callback was never being triggered during inference.

This commit adds debugging to help diagnose why the callback isn't being triggered: 1. Set params.cb_eval_user_data explicitly to nullptr 2. Disable warmup (params.warmup = false) as warmup may interfere with callbacks - following the pattern from imatrix tool 3. Add debug logging to show when callback is first invoked 4. Add debug logging when callback asks about tensors This will help us understand if: - The callback is being invoked at all - The callback is seeing tensors but filtering them out - There's an issue with the callback registration To test with debug output, run: LLAMA_LOG_LEVEL=5 ./llama-cli -m model.gguf --interactive

This commit improves visibility into the activation collection process, especially when using GPU offloading: Changes: 1. Changed first callback log from LOG_DBG to LOG - now always visible to immediately show if callback is being invoked 2. Added collection counter to track how many activations are collected 3. Changed duplicate handling - instead of skipping, create unique names (tensor_1, tensor_2, etc.) to capture all activations 4. Added detailed logging showing if tensors are on CPU vs GPU 5. Added periodic progress logging every 10 collections 6. Improved debug output to show operation type The code already handles GPU tensors correctly using ggml_backend_tensor_get() to copy from GPU memory to host, following the same pattern as the imatrix tool. This will help diagnose why activations might not be collected when using GPU offloading.

ROOT CAUSE FOUND: The callback was never being invoked because llama.cpp reuses the computation graph on subsequent inferences for performance. When the graph is reused, the callback is NOT set (see llama-context.cpp:778 - callback only set when graph is rebuilt). SOLUTION: Set the environment variable LLAMA_GRAPH_REUSE_DISABLE=1 when activation collection is enabled. This forces the graph to be rebuilt on every inference, which ensures the callback gets set and invoked. Changes: 1. Call setenv("LLAMA_GRAPH_REUSE_DISABLE", "1", 1) in both: - When --dump-activations is used - When in interactive mode (for /\/save command) 2. Added log message to inform users graph reuse is disabled Note: Disabling graph reuse will slightly reduce performance, but this is necessary for the callback mechanism to work. This only affects runs where activation dumping is enabled. This should finally fix the issue where 0 activations were being collected!

Replace callback-based approach with simpler state serialization: - Remove all callback and graph reuse disabling code - Use llama_state_get_data/set_data for complete state capture - Save/restore: KV cache, RNG state, logits, embeddings - No performance penalty from disabled optimizations Features: - CLI flags: --dump-activations, --load-activations - Interactive: /\/save <file>, /\/load <file> - GGUF format with metadata (version, size, type) - Exact state restoration for conversation continuity This provides the ability to save and restore the LLM to exactly the same conversational state, enabling conversation checkpointing and experimentation with different conversation branches.

Implements interactive temperature get/set functionality: - /\/temp - displays current temperature setting - /\/temp <value> - changes temperature on the fly Technical implementation: - Add common_sampler_get_temp() and common_sampler_set_temp() to sampling API - Set temp works by removing old temperature sampler from chain and replacing it - Preserves dynamic temperature settings (dynatemp_range, exponent) when set - Validates temperature values (must be >= 0.0) This allows users to experiment with different temperature values during a conversation without restarting the program, enabling exploration of how temperature affects model outputs in real-time.

The previous swapping logic to reposition the sampler in the chain was buggy and not actually changing the temperature. Fixed approach: 1. Remove old temperature sampler at position temp_idx 2. Collect all samplers that come after that position 3. Add new temperature sampler with new value 4. Add back all collected samplers in original order This correctly preserves the sampler chain order while replacing the temperature sampler, so temperature changes now actually take effect. Users should now see dramatic differences in output randomness: - temp 0.1: very deterministic - temp 1.0: balanced - temp 10.0+: increasingly random/creative

User reports temperature changes aren't taking effect. Added detailed logging to trace: - Sampler chain composition - Temperature sampler detection - Removal and replacement process - Final chain state This will help identify why temperature modifications aren't affecting model output behavior.

Implements automatic KV cache persistence for debugging workflows: New CLI flags: - --kv-cache-auto-save <base-name> Automatically saves all slot KV caches to timestamped directory on server shutdown: <base-name>_YYYYMMDD_HHMMSS/ - --kv-cache-auto-load <dirname> Automatically loads all slot KV caches from specified directory on server startup Implementation: - auto_save_kv_cache(): Saves each non-empty slot to slot_N.bin in timestamped directory on server shutdown - auto_load_kv_cache(): Loads each slot from slot_N.bin files on server initialization - Uses llama_state_seq_save_file/load_file for per-slot persistence - Integrated into server lifecycle: load after init(), save in cleanup Usage example: # First run - build up KV cache state llama-server -m model.gguf --kv-cache-auto-save my_cache # Server shutdown creates: my_cache_20250108_143022/ # Second run - restore state instantly llama-server -m model.gguf --kv-cache-auto-load my_cache_20250108_143022 This enables fast server restarts during debugging by preserving the complete conversation/context state across sessions.

New endpoint: POST /save-kv-cache Allows saving KV cache at any point during server operation, not just on shutdown. Useful for creating checkpoints during interactive debugging sessions. Request body (optional): { "dirname": "my_checkpoint" // Custom directory name } Response: { "success": true, "directory": "my_checkpoint", // or timestamped if not specified "message": "KV cache saved successfully" } If dirname is not provided, automatically generates timestamped directory name using --kv-cache-auto-save base name. Implementation: - Refactored auto_save_kv_cache() into save_kv_cache_to_dir(dirname) - save_kv_cache_to_dir() accepts optional custom directory name - Returns directory name on success, empty string on failure - New endpoint handler parse JSON body and calls save function - Registered at: POST /save-kv-cache Usage examples: # Save with custom name curl -X POST http://localhost:8080/save-kv-cache \ -H "Content-Type: application/json" \ -d '{"dirname": "checkpoint_before_fix"}' # Save with auto-generated timestamp curl -X POST http://localhost:8080/save-kv-cache \ -H "Content-Type: application/json" \ -d '{}'

Implements a production-ready activation capture system that allows real-time streaming of intermediate layer activations to disk for analysis: - Queue-based async I/O with background writer thread to avoid blocking inference - GPU tensor support via automatic ggml_backend_tensor_get() transfers - Flexible filtering by regex patterns and layer ranges - Binary file format (LLMACT01) with timestamped metadata entries - Size limits to prevent unbounded disk usage - HTTP endpoints: * POST /activations/start - Begin capture with filters and limits * POST /activations/stop - Stop capture and finalize output * GET /activations/status - Query current capture statistics Implementation details: - Callback set via params.cb_eval at model initialization - Global pointer g_activation_capture enables thread-safe dynamic control - Producer-consumer pattern with condition variables for queue management - Atomic counters for bytes_written and entries_captured statistics This enables debugging and analysis workflows like: - Comparing activations between model versions - Identifying problematic layers causing inference issues - Analyzing attention patterns and intermediate representations - Debugging quantization effects on specific layers

Implements a flexible tool-calling system that allows LLMs to invoke external executables from a "tools" directory: Features: - LLM can request tool list by outputting <tools-help/> * Automatically scans "tools" directory for executables * Runs each with "help" parameter to collect usage info * Injects concatenated help text back into conversation - LLM can execute tools by outputting <tool-launch>tool-name args</tool-launch> * Executes the specified tool from tools/ directory * Captures stdout/stderr and exit code * Injects output back into conversation for LLM to process Implementation: - Platform-specific: Full support on Unix/macOS, stub on Windows - Uses popen() for command execution with output capture - Alphabetically sorted tool listing for consistency - Robust parsing of tool-launch tags with argument extraction - Checks recent output buffer (128 tokens) for tag detection Example tools directory structure: tools/ calculator (executable) web_search (executable) file_reader (executable) This enables LLMs to: - Access external data sources - Perform calculations - Query databases or APIs - Interact with system utilities - Extend capabilities without retraining Security note: Only executables in the "tools" directory are accessible, providing a sandboxed environment for tool execution.

Changed tool detection logic to use mutually exclusive checks (if/else) instead of independent checks. This prevents tool help text containing <tool-launch> examples from being accidentally executed. Previously: - Both <tools-help/> and <tool-launch> were checked in the same iteration - If help text contained example usage like "<tool-launch>calc 2+2</tool-launch>", it would be detected and executed immediately after being injected Now: - Only one tool action is processed per iteration - If <tools-help/> is detected, skip <tool-launch> check until next iteration - Help examples remain as documentation without triggering execution

Fixes two critical issues with tool calling: 1. Think tag filtering: - Tool tags inside <think>...</think> are now ignored - Added is_inside_think_tags() to check if a position is within think blocks - Prevents accidental tool execution during model reasoning - Recursively searches for tool-launch tags outside think blocks 2. Duplicate execution prevention: - Tracks last executed tool signature (tool_name|args) - Skips re-execution if same tool call detected in buffer - Resets signature on new user input to allow reuse in conversation - Prevents multiple executions when tag remains in 128-token lookback window Example scenarios now handled correctly: Scenario 1 - Think tags: Model: <think>Maybe I should use <tool-launch>calc 2+2</tool-launch></think> Result: Tool NOT executed (inside think block) Model: Let me calculate this. <tool-launch>calc 2+2</tool-launch> Result: Tool executed (outside think block) Scenario 2 - Duplicates: Model generates: <tool-launch>search foo</tool-launch> Iteration 1: Tool executed, output injected Iteration 2: Same tag still in buffer -> skipped User types new input Model generates: <tool-launch>search foo</tool-launch> Result: Tool executed again (signature reset on user input) This ensures tools are only executed when the model explicitly intends to use them outside of reasoning blocks, and each tool call executes exactly once.

Implements a feature to auto-submit empty input after a period of user inactivity, allowing the agent to continue thinking without user interaction. New CLI flag: --idle-action-interval N Auto-submit empty input after N minutes of idle time (default: 0 = disabled) How it works: - Tracks last activity timestamp (updated on any user input) - Before waiting for user input, checks if idle interval has elapsed - If idle timeout reached, automatically submits empty input - Resets timer after auto-submission for next iteration - Any keystroke/input from user resets the idle timer Use cases: - Agent continues reasoning/thinking during long idle periods - Useful for autonomous workflows where agent should self-prompt - Allows agent to work through complex problems without waiting Example usage: llama-cli -m model.gguf --idle-action-interval 5 (Agent will auto-submit empty input after 5 minutes of no user activity) Implementation notes: - Activity time tracked globally via g_last_activity_time - Idle check happens when interactive mode waits for input - Auto-submitted input is distinguishable from Ctrl+D (EOF) - Console readline is bypassed when idle timeout triggers - Timer resets on both manual and automatic input submission Changes: - common/common.h: Added idle_action_interval parameter - common/arg.cpp: Added --idle-action-interval argument parser - tools/main/main.cpp: Implemented idle timeout logic and tracking

Previously, the idle timer would continue running while the agent was generating output, which could cause immediate auto-submission if the agent took longer than the idle interval to respond. Now the timer resets when transitioning to interactive mode (when is_interacting becomes true), ensuring it only measures time spent waiting for user input, not time spent generating. Behavior before fix: 1. User enters input → timer updates (t=0) 2. Agent generates for 10 minutes → timer running (t=10m) 3. Agent finishes → idle check 4. If idle_interval=5m → triggers immediately (10m > 5m) Behavior after fix: 1. User enters input → timer updates (t=0) 2. Agent generates for 10 minutes → timer running but will be reset 3. Agent finishes, enters interactive mode → timer resets (t=0) 4. User idle for 5 minutes → idle timeout triggers correctly (5m >= 5m) The fix adds update_activity_time() call at line 1277 when entering the "waiting for user input" state, right before displaying the prompt.

Previous implementation checked for timeout but then blocked indefinitely on readline. Now uses select() (Unix) / WaitForSingleObject (Windows) to check if input is available before blocking. Changes: - common/console.h: Added readline_with_timeout() function - common/console.cpp: Implemented timeout using select()/WaitForSingleObject - tools/main/main.cpp: Use readline_with_timeout with calculated timeout How it works: 1. Calculate remaining timeout based on idle_action_interval and elapsed time 2. Call readline_with_timeout() with remaining seconds 3. If timeout occurs, auto-submit empty input 4. If user types anything, reset timer and disable timeout for continuation lines Unix implementation: - Uses select() on STDIN_FILENO with timeout - Returns immediately if input available or timeout elapsed Windows implementation: - Uses WaitForSingleObject() on stdin handle with timeout - Returns immediately if input available or timeout elapsed This fixes the issue where idle timeout would never trigger because readline() was blocking indefinitely waiting for input.

… thinking Previously, tool deduplication signature was only reset on explicit user input, preventing the LLM from reusing tools during idle-triggered thinking sessions. Now the signature is also reset when idle timeout triggers auto-submission, treating it as a new conversational turn where tools can be used again. Behavior before: 1. LLM uses <tool-launch>calculator 5+5</tool-launch> → executes 2. User idles, timeout triggers → empty input submitted 3. LLM tries <tool-launch>calculator 10+10</tool-launch> → blocked (duplicate) Behavior after: 1. LLM uses <tool-launch>calculator 5+5</tool-launch> → executes 2. User idles, timeout triggers → signature reset, empty input submitted 3. LLM tries <tool-launch>calculator 10+10</tool-launch> → executes (new turn) This allows the agent to fully utilize tools during autonomous thinking sessions triggered by the --idle-action-interval feature.

Adds an interactive command to view and change the idle action interval without restarting llama-cli. Usage: /\/timeout - Show current timeout and disable if enabled /\/timeout <minutes> - Set idle timeout to N minutes /\/timeout 0 - Disable idle timeout Behavior: - /\/timeout with no args displays current setting and disables if enabled - /\/timeout N sets the timeout to N minutes (0 = disabled) - When setting a new non-zero timeout, timer resets immediately - Complements --idle-action-interval CLI flag with runtime control Example session: > /\/timeout Idle timeout is currently disabled (0 minutes) > /\/timeout 5 Changing idle timeout from 0 to 5 minutes Idle timeout set to 5 minutes > /\/timeout Current idle timeout: 5 minutes Disabling idle timeout This allows users to dynamically enable/disable autonomous agent thinking during a session based on their workflow needs.

When idle timeout triggered with no user input, buffer was empty but the code attempted to access buffer.back() which is undefined behavior. The issue occurred because: 1. Timeout triggers with empty buffer 2. Condition `buffer.empty() && !timed_out` is FALSE (timed_out is true) 3. Code continues to access buffer.back() on empty string 4. Undefined behavior reads garbage memory 5. LLM sees random characters instead of empty input This explains why the LLM reported receiving a "one-word string" instead of empty input - it was reading uninitialized memory. Fix: Add check to ensure buffer is not empty before accessing buffer.back() Changed from: if (buffer.back() == '\n') { To: if (!buffer.empty() && buffer.back() == '\n') { Now idle timeout correctly sends truly empty input to the model.

Added LOG_DBG to show how many tokens are in tool output before injection. This helps diagnose KV cache allocation failures with large contexts and flash attention when tool output is injected. Related to investigating "decode: failed to find a memory slot" errors when using very large contexts (e.g., 96000) with flash attention enabled.

claude added 24 commits November 8, 2025 11:11

ayourtch requested review from ggerganov and ngxson as code owners November 9, 2025 22:56

github-actions bot added examples server labels Nov 9, 2025

ayourtch closed this Nov 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Claude/add activation dump load cli 011 c uv j vgmad UI a6mq n8pbaq #17130

Claude/add activation dump load cli 011 c uv j vgmad UI a6mq n8pbaq #17130

Uh oh!

ayourtch commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Claude/add activation dump load cli 011 c uv j vgmad UI a6mq n8pbaq #17130

Claude/add activation dump load cli 011 c uv j vgmad UI a6mq n8pbaq #17130

Uh oh!

Conversation

ayourtch commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants