evansenter · evansenter · Jan 19, 2026 · Jan 18, 2026 · Jan 18, 2026 · Jan 18, 2026
diff --git a/.github/workflows/rust.yml b/.github/workflows/rust.yml
@@ -37,6 +37,32 @@ jobs:
         run: cargo nextest run --workspace
       # Note: No doctests - clemini is a binary crate without a library target
 
+  test-integration:
+    name: Integration Tests
+    needs: check
+    runs-on: ubuntu-latest
+    # Only run on push to main or PRs from same repo (forks don't have secrets)
+    if: github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository
+    steps:
+      - uses: actions/checkout@v4
+      - uses: dtolnay/rust-toolchain@stable
+      - uses: taiki-e/install-action@cargo-nextest
+      - uses: Swatinem/rust-cache@v2
+        with:
+          shared-key: integration
+      - name: Install mold linker
+        run: sudo apt-get update && sudo apt-get install -y mold
+      - name: Run integration tests
+        run: |
+          set -e
+          for test in confirmation_tests tool_output_tests semantic_integration_tests; do
+            echo "::group::Running $test"
+            cargo nextest run --test $test --run-ignored all
+            echo "::endgroup::"
+          done
+        env:
+          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
+
   fmt:
     name: Format
     runs-on: ubuntu-latest

diff --git a/.gitignore b/.gitignore
@@ -17,7 +17,7 @@ Thumbs.db
 .env.local
 tmp/
 
-# MCP config (local)
+# MCP config (machine-specific paths)
 .mcp.json
 
 # Claude Code local settings
@@ -30,3 +30,11 @@ tmp/
 *.log
 error.log
 output.txt
+
+# Benchmark build artifacts
+benchmark/exercises/**/.gradle/
+benchmark/exercises/**/bin/
+benchmark/exercises/**/node_modules/
+benchmark/exercises/**/build/
+benchmark/exercises/**/*.class
+benchmark/exercises/**/package-lock.json
diff --git a/.mcp.json.example b/.mcp.json.example
@@ -0,0 +1,11 @@
+{
+  "mcpServers": {
+    "clemini": {
+      "command": "/path/to/clemini/target/release/clemini",
+      "args": ["--mcp-server"],
+      "env": {
+        "GEMINI_API_KEY": "${GEMINI_API_KEY}"
+      }
+    }
+  }
+}
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -19,7 +19,8 @@ Clemini is a Gemini-powered coding CLI built with genai-rs. It's designed to be
 make check               # Fast type checking
 make build               # Debug build
 make release             # Release build
-make test                # Run tests
+make test                # Unit tests only (fast, no API key)
+make test-all            # Full suite including integration tests (requires GEMINI_API_KEY)
 make clippy              # Lint with warnings as errors
 make fmt                 # Format code
 make logs                # Tail human-readable logs
@@ -72,7 +73,7 @@ run_interaction()                    UI Layer
 - `McpEventHandler` (`mcp.rs`) - MCP server mode
 
 All handlers use shared formatting functions:
-- `format_tool_executing()` - Format tool executing line (`🔧 name args`)
+- `format_tool_executing()` - Format tool executing line (`┌─ name args`)
 - `format_tool_result()` - Format tool completion line (`└─ name duration ~tokens tok`)
 - `format_tool_args()` - Format tool arguments as key=value pairs (used by format_tool_executing)
 - `format_context_warning()` - Format context window warnings
@@ -113,6 +114,7 @@ Debugging: `LOUD_WIRE=1` logs all HTTP requests/responses.
 
 ## Documentation
 
+- [docs/TOOLS.md](docs/TOOLS.md) - Tool reference, design philosophy, implementation guide
 - [docs/TUI.md](docs/TUI.md) - TUI architecture (ratatui, event loop, output channels)
 - [docs/TEXT_RENDERING.md](docs/TEXT_RENDERING.md) - Output formatting guidelines (colors, truncation, spacing)
 
@@ -141,11 +143,20 @@ Debugging: `LOUD_WIRE=1` logs all HTTP requests/responses.
 
 Don't skip tests. If a test is flaky or legitimately broken by your change, fix the test as part of the PR.
 
+**Integration tests** - Tests in `tests/` that require `GEMINI_API_KEY` use semantic validation:
+- `confirmation_tests.rs` - Confirmation flow for destructive commands
+- `tool_output_tests.rs` - Tool output events and model interpretation
+- `semantic_integration_tests.rs` - Multi-turn state, error recovery, code analysis
+
+Run locally with: `cargo test --test <name> -- --include-ignored --nocapture`
+
+These use `validate_response_semantically()` from `tests/common/mod.rs` - a second Gemini call with structured output that judges whether responses are appropriate. This provides a middle ground between brittle string assertions and purely structural checks.
+
 **Visual output changes** - Tool output formatting is centralized in `src/events.rs`:
 
 | Change | Location |
 |--------|----------|
-| Tool executing format (`🔧 name...`) | `format_tool_executing()` in `events.rs` |
+| Tool executing format (`┌─ name...`) | `format_tool_executing()` in `events.rs` |
 | Tool result format (`└─ name...`) | `format_tool_result()` in `events.rs` |
 | Tool error detail (`└─ error:...`) | `format_error_detail()` in `events.rs` |
 | Tool args format (`key=value`) | `format_tool_args()` in `events.rs` |
@@ -178,6 +189,18 @@ Test visual changes by running clemini in each mode and verifying the output loo
 - `TuiEventHandler` in `main.rs` (needs `AppEvent`)
 - `McpEventHandler` in `mcp.rs` (needs MCP notification channel)
 
+**Tool output via events** - Tools emit `AgentEvent::ToolOutput` for visual output, never call `log_event()` directly. This ensures correct ordering (all output flows through the event channel) and keeps tools decoupled from the UI layer. The standard `emit()` helper pattern:
+```rust
+fn emit(&self, output: &str) {
+    if let Some(tx) = &self.events_tx {
+        let _ = tx.try_send(AgentEvent::ToolOutput(output.to_string()));
+    } else {
+        crate::logging::log_event(output);
+    }
+}
+```
+Uses `try_send` (non-blocking) to avoid stalling tools on slow consumers. The fallback to `log_event()` allows tools to work in contexts where events aren't available (e.g., direct tool tests).
+
 ### Module Responsibilities
 
 | Module | Responsibility |

diff --git a/Cargo.toml b/Cargo.toml
@@ -59,3 +59,4 @@ similar = "2"
 [dev-dependencies]
 tempfile = "3.10"
 mockito = "1.2"
+serial_test = "3.1"
diff --git a/Makefile b/Makefile
@@ -1,4 +1,4 @@
-.PHONY: check build release test clippy fmt logs
+.PHONY: check build release test test-all clippy fmt logs
 
 LOG_DIR = $(HOME)/.clemini/logs
 LOG_FILE = $(LOG_DIR)/clemini.log.$(shell date +%Y-%m-%d)
@@ -12,8 +12,15 @@ build:
 release:
 	cargo build --release
 
+# Run unit tests only (fast, no API key required)
 test:
-	cargo test
+	cargo test --lib
+	cargo test --bin clemini
+	cargo test --test event_ordering_tests
+
+# Run all tests including integration tests (requires GEMINI_API_KEY)
+test-all:
+	cargo nextest run --run-ignored all
 
 clippy:
 	cargo clippy -- -D warnings

diff --git a/benchmark/run.py b/benchmark/run.py
@@ -8,6 +8,40 @@
 from pathlib import Path
 from concurrent.futures import ThreadPoolExecutor, as_completed
 
+
+def check_exercises_dirty():
+    """Check if benchmark/exercises has uncommitted changes. Returns list of modified files."""
+    result = subprocess.run(
+        ["git", "status", "--porcelain", "benchmark/exercises/"],
+        capture_output=True,
+        text=True,
+    )
+    if result.returncode != 0:
+        return []
+    # Filter to only modified tracked files (M or space+M), not untracked (??)
+    modified = []
+    for line in result.stdout.strip().split("\n"):
+        if line and not line.startswith("??"):
+            # Extract filename (after the status prefix)
+            modified.append(line[3:] if len(line) > 3 else line)
+    return [f for f in modified if f]
+
+
+def reset_exercises():
+    """Reset benchmark/exercises to clean state using git checkout."""
+    print("Resetting exercises to clean state...")
+    result = subprocess.run(
+        ["git", "checkout", "--", "benchmark/exercises/"],
+        capture_output=True,
+        text=True,
+    )
+    if result.returncode != 0:
+        print(f"Warning: git checkout failed: {result.stderr}")
+        return False
+    print("Exercises reset successfully.")
+    return True
+
+
 def run_clemini(prompt, cwd):
     """Call clemini via subprocess with the given prompt."""
     cmd = [
@@ -117,16 +151,39 @@ def main():
     parser = argparse.ArgumentParser(description="Run clemini benchmark on exercises.")
     parser.add_argument("--parallel", type=int, default=2, help="Number of exercises to run in parallel.")
     parser.add_argument("--time-limit", type=int, default=5, help="Time limit in minutes.")
+    parser.add_argument("--reset", action="store_true", help="Reset exercises to clean state before running.")
+    parser.add_argument("-y", "--yes", action="store_true", help="Skip confirmation prompts.")
     args = parser.parse_args()
 
     repo_root = Path(__file__).parent.parent.absolute()
     os.chdir(repo_root)
-    
+
     base_dir = Path("benchmark/exercises")
     if not base_dir.exists():
         print(f"Error: {base_dir} not found. Run setup.py first.")
         sys.exit(1)
-
+
+    # Handle reset flag
+    if args.reset:
+        reset_exercises()
+    else:
+        # Check for dirty state and warn
+        dirty_files = check_exercises_dirty()
+        if dirty_files:
+            print(f"\n⚠️  Warning: {len(dirty_files)} exercise file(s) have uncommitted changes:")
+            for f in dirty_files[:10]:  # Show first 10
+                print(f"   {f}")
+            if len(dirty_files) > 10:
+                print(f"   ... and {len(dirty_files) - 10} more")
+            print("\nBenchmark results may be affected by previous runs.")
+            print("Use --reset to restore exercises to clean state.\n")
+
+            if not args.yes:
+                response = input("Continue anyway? [y/N] ").strip().lower()
+                if response not in ("y", "yes"):
+                    print("Aborted.")
+                    sys.exit(0)
+
     exercises = sorted([d.name for d in base_dir.iterdir() if d.is_dir()])
     random.shuffle(exercises)
 

diff --git a/docs/TEXT_RENDERING.md b/docs/TEXT_RENDERING.md
@@ -39,7 +39,7 @@ All three UI modes (Terminal, TUI, MCP) implement the `EventHandler` trait in `e
 
 | Function | Output |
 |----------|--------|
-| `format_tool_executing()` | `🔧 tool_name args...` |
+| `format_tool_executing()` | `┌─ tool_name args...` |
 | `format_tool_result()` | `└─ tool_name 0.02s ~18 tok` |
 | `format_error_detail()` | `  └─ error: message` |
 | `format_tool_args()` | `key=value key2=value2` |
@@ -57,7 +57,7 @@ Uses the `colored` crate for ANSI terminal colors:
 | Tool names | Cyan | `.cyan()` |
 | Duration | Yellow | `.yellow()` |
 | Error labels | Bright red + bold | `.bright_red().bold()` |
-| Tool emoji (🔧) | Dimmed grey | `.dimmed()` |
+| Tool bracket (┌─) | Dimmed grey | `.dimmed()` |
 | Tool arguments | Dimmed grey | `.dimmed()` |
 | Bash command/output | Dimmed grey + italic | `.dimmed().italic()` |
 | Diff deletions | Red | `.red()` |
@@ -72,16 +72,16 @@ Uses the `colored` crate for ANSI terminal colors:
 ### Executing Line (Before Execution)
 
 ```
-🔧 <tool_name> <formatted_args>
+┌─ <tool_name> <formatted_args>
 ```
 
-- `🔧`: Dimmed
+- `┌─`: Dimmed
 - `<tool_name>`: Cyan
 - `<formatted_args>`: Dimmed grey, key=value pairs
 
 Example:
 ```
-🔧 read_file file_path="/src/main.rs"
+┌─ read_file file_path="/src/main.rs"
 ```
 
 ### Result Line (After Execution)

diff --git a/docs/TOOLS.md b/docs/TOOLS.md
@@ -222,17 +222,40 @@ Execute shell commands.
 ---
 
 #### kill_shell
-Kill a background bash task.
+Kill a background task (bash or subagent).
 
 **Parameters:**
 | Name | Type | Required | Description |
 |------|------|----------|-------------|
-| task_id | string | yes | Task ID from bash with `run_in_background=true` |
+| task_id | string | yes | Task ID from bash or task tool |
 
 **Returns:** `{task_id, status, success}`
 
 ---
 
+#### task
+Spawn a clemini subagent to handle delegated work.
+
+**Parameters:**
+| Name | Type | Required | Description |
+|------|------|----------|-------------|
+| prompt | string | yes | The task/prompt for the subagent |
+| background | boolean | no | Return immediately with task_id. (default: false) |
+
+**Returns:** `{status, stdout, stderr, exit_code}` or `{task_id, status, prompt}` when `background=true`
+
+**Limitations:**
+- Subagent cannot use interactive tools (`ask_user`) - stdin is null
+- Subagent gets its own sandbox based on cwd (does not inherit parent's `allowed_paths`)
+- Background tasks are fire-and-forget (no output capture yet - see issue #79)
+
+**Use cases:**
+- Parallel work on independent subtasks
+- Breaking down complex tasks for focused execution
+- Long-running operations that don't need real-time output
+
+---
+
 ### Interaction
 
 #### ask_user
@@ -306,5 +329,7 @@ Fetch and optionally process a web page.
 | Create new files | `write_file` | Only for new files or complete rewrites |
 | Run builds/tests | `bash` | Shell commands with output capture |
 | Long-running commands | `bash` + `run_in_background` | Don't block on slow operations |
+| Delegate complex work | `task` | Spawn focused subagent for subtasks |
+| Parallel subtasks | `task` + `background=true` | Multiple subagents working concurrently |
 | Need user input | `ask_user` | Rather than guessing |
 | Multi-step tasks | `todo_write` | Create todos FIRST, then work through them |