Taskgrind runs repeated AI coding sessions against any repo that keeps its queue
in TASKS.md, stopping when the deadline, queue state, or stall guard says the
run is done. Use taskgrind --preflight to verify the backend and repo before a
long run, then steer later sessions with repo-local prompt or model overrides
instead of restarting the whole grind.
Sessions should exit before context fills; context exhaustion can crash the process and lose uncommitted work.
Autonomous multi-session grind — runs sequential AI coding sessions until a deadline. Each session starts with full context. State lives in TASKS.md + git, so sessions pick up seamlessly. Sessions still need to exit before the model context fills up; a context-exhausted crash can drop any uncommitted work from that session.
Taskgrind ships built-in backends for Devin, Claude Code, and Codex, and it works with any repo that uses the tasks.md spec for task management.
For local tests and repo audit helpers, keep DVB_GRIND_CMD to a single executable path. If you need a compound shell command, wrap it in a helper script first so preflight and session launch can validate it correctly.
Requires macOS or Linux (or WSL on Windows).
You need at least one AI coding backend installed:
| Backend | Install |
|---|---|
| Devin CLI | curl -fsSL https://cli.devin.ai/install.sh | sh |
| Claude Code | npm install -g @anthropic-ai/claude-code |
| Codex | npm install -g @openai/codex |
Taskgrind defaults to Devin. Use --backend claude-code or --backend codex to switch.
Use taskgrind --preflight ~/apps/myrepo after installing a backend. The same
checks run before a real grind starts, so this is the fastest way to confirm the
binary, model, and network assumptions for the backend you chose.
| Backend | Binary taskgrind looks for | Model validation before session 1 | Most actionable setup failures |
|---|---|---|---|
devin |
devin from PATH, or TG_DEVIN_PATH if you override it |
Validates the requested model by running devin --model "$TG_MODEL" --help during preflight |
Backend binary not found (devin) means the CLI is missing or TG_DEVIN_PATH points at the wrong file. Model rejected by devin before starting means the model string is wrong for your Devin install. If the startup probe says the binary is a stub or broken after --version, reinstall or roll back the Devin CLI before retrying. |
claude-code |
claude from PATH |
Validates the requested model by running claude --model "$TG_MODEL" --help during preflight |
Backend binary not found (claude-code) usually means @anthropic-ai/claude-code is not installed globally or claude is not on PATH. Model rejected by claude-code before starting means the selected Claude model is unavailable to that install or account. |
codex |
codex from PATH |
Validates the requested model by running codex --model "$TG_MODEL" --help during preflight |
Backend binary not found (codex) means the Codex CLI is missing from PATH. If you keep the default Anthropic-flavored model while using --backend codex, taskgrind warns before launch because Codex expects an OpenAI model such as o3 or gpt-5.4. A later Model rejected by codex before starting failure means the chosen OpenAI model name is not accepted by your local Codex install. |
Practical examples:
taskgrind --preflight ~/apps/myrepo
taskgrind --preflight --backend claude-code --model claude-sonnet-4.6 ~/apps/myrepo
taskgrind --preflight --backend codex --model o3 ~/apps/myrepobrew install cbrwizard/tap/taskgrind# One-liner
curl -fsSL https://raw.githubusercontent.com/cbrwizard/taskgrind/main/install.sh | sh
# Or clone manually
git clone https://github.com/cbrwizard/taskgrind.git ~/apps/taskgrind
# Custom install directory
TASKGRIND_INSTALL_DIR=~/tools/taskgrind sh -c "$(curl -fsSL https://raw.githubusercontent.com/cbrwizard/taskgrind/main/install.sh)"
# Add to PATH (add to your shell rc)
export PATH="$HOME/apps/taskgrind/bin:$PATH"To update: brew upgrade taskgrind (Homebrew) or cd ~/apps/taskgrind && git pull --rebase (manual)
Contributor audit shortcut: run make audit to reproduce the local repo-audit pass (an actionable scan for real task markers, plus the core docs and repo-local audit skills, shellcheck, and the core docs review queue, including README.md, CONTRIBUTING.md, SECURITY.md, AGENTS.md, Agentfile.yaml, docs/architecture.md, docs/resume-state.md, docs/user-stories.md, man/taskgrind.1, .devin/skills/standing-audit-gap-loop/SKILL.md, and .devin/skills/grind-log-analyze/SKILL.md) without any network-only dependencies.
taskgrind # 10h grind (default), current dir
taskgrind 10 # 10h grind
taskgrind ~/apps/myrepo 10 # 10h grind in specific repo
taskgrind --model claude-opus-4-7-max 8 # use specific model
taskgrind --model "gpt-5.4 XHigh thinking fast" 8 # quote multi-word model names
taskgrind --skill pipeline-ops ~/apps/bosun 10 # custom installed skill
taskgrind --prompt "focus on test coverage" 8 # focus prompt
taskgrind --backend claude-code 8 # use Claude Code backend
taskgrind --dry-run 8 ~/apps/myrepo # print config without running
taskgrind --preflight ~/apps/myrepo # run health checks only
taskgrind --resume ~/apps/myrepo # resume an interrupted grind
taskgrind --help / -h # show usage and environment variables
taskgrind --version / -V # print version (commit hash + date)
TG_MODEL=sonnet taskgrind 8 # pick a model alias without changing shell history
TG_BACKEND=codex taskgrind 8 # make a wrapper or terminal default use Codex
TG_MAX_INSTANCES=3 taskgrind ~/apps/myrepo 8 # allow three concurrent grinds per repo
TG_STATUS_FILE=/tmp/taskgrind-status.json taskgrind ~/apps/myrepo 8 # write machine-readable status snapshotsArguments can appear in any order. Hours is any bare integer 1-24.
Env vars are especially useful when you wrap taskgrind in launchd, cron,
shell aliases, or a small supervisor script. Use flags when you want a
one-off override in your shell history; use TG_BACKEND or TG_MODEL when
you want restarts and helper scripts to inherit the same defaults without
retyping them on every launch.
--skill accepts any installed skill. Use repo-local skills such as
standing-audit-gap-loop or globally installed skills such as pipeline-ops
when you want a lane other than the default next-task workflow.
- Launches an AI session with the
next-taskskill (configurable via--skill, backend via--backend) - Session picks a task from
TASKS.md, implements it, commits, and exits - Between sessions: cooldown, optional git sync (every N sessions)
- Exits when: queue empty, all remaining tasks blocked, deadline reached, or stall detected
That session boundary is also the context-budget guard: keep prompts, plans, and scope small enough that each agent run can finish and commit before its context window fills. If a session crashes from context exhaustion, taskgrind can resume from git and TASKS.md, but any uncommitted edits from the crashed run are gone.
If a run stops because of a reboot, terminal loss, or a recoverable backend
failure, taskgrind --resume <repo> restores the saved deadline, counters,
backend, skill, startup prompt baseline, and startup model baseline for that
same grind. It deliberately does not resurrect uncommitted edits from the
interrupted session, so treat resume as "continue from the last clean commit,"
not "recover everything that was in memory." See docs/resume-state.md for
the exact validation contract and docs/user-stories.md for an operator-facing
resume example.
Taskgrind reads TASKS.md following the tasks.md spec. Tasks use checkbox format under priority headings:
# Tasks
## P0
- [ ] Fix critical bug in auth flow
**ID**: fix-auth-bug
**Tags**: bug, auth
**Details**: The OAuth callback fails when...
**Files**: `bin/taskgrind`, `tests/preflight.bats`
**Acceptance**: Users can complete the OAuth callback without a retry loop.
## P1
- [ ] Add retry logic to API calls
**ID**: add-api-retry
**Tags**: reliability, api
**Details**: Retries should cover transient 502/503 responses only.
**Files**: `bin/taskgrind`, `tests/network.bats`
**Acceptance**: Transient upstream failures retry with backoff and permanent failures still exit fast.
**Blocked by**: backend-rate-limit-policyUse **Blocked by** only when another task or external dependency truly prevents progress. Completed tasks are removed (not checked off). History lives in git log. See the tasks.md spec for the full format.
- Multi-backend support — works with Devin, Claude Code, and Codex via
--backend - Model selection —
--model claude-opus-4-7-maxorTG_MODEL=claude-opus-4-7-maxto use any model the backend supports; quote multi-word model names such as--model "gpt-5.4 XHigh thinking fast"; short aliases likeopusandsonnetresolve to the current preferred model IDs - Live model switching — create/edit
.taskgrind-modelin the repo while running; changes take effect at the next session, including short alias resolution. Delete the file to revert to the startup model. Files larger than 1 KB are ignored with a warning. - Live prompt injection — create/edit
.taskgrind-promptin the repo while running; changes take effect at the next session. Files larger than 10 KB are ignored with a warning. - Preflight checks — validates the backend, network, repo, disk, queue, and optional watchdog setup before launch, plus active slot reporting.
network-watchdogis optional; if missing, taskgrind falls back tocurlfor connectivity checks. - Self-copy protection — copies itself to
$TMPDIRbefore running, survives script edits mid-grind - Slot-based per-repo locking —
TG_MAX_INSTANCESallows multiple concurrent grinds on the same repo; slot 0 owns between-session git sync, higher slots get conflict-avoidance prompt guidance - Blocked-queue detection — exits early when all remaining tasks have
**Blocked by**:metadata - Caffeinate integration — prevents system sleep on macOS (
caffeinate) and Linux (systemd-inhibit) - Git sync with stash/rebase — between-session sync stashes dirty work, auto-detects the repo default branch from
origin/HEAD, remote HEAD probes, upstream tracking, or local branch fallbacks, then rebases there and cleans merged branches; tests can force the branch withDVB_DEFAULT_BRANCH. If stash creation fails, taskgrind logs the original git error and skipsstash pop; ifstash popfails after a successful stash, it leaves the stash intact for manual recovery. When a rebase conflict only touchesTASKS.md, taskgrind now auto-resolves it by keeping the local queue edit so queue churn does not leave the repo stuck mid-rebase. - Empty-queue sweep — when
TASKS.mdis empty, launches a sweep session to find work, then waits for external task injection before exiting - Network resilience — pauses on network loss, extends deadline on recovery
- Stall detection — bails after consecutive zero-ship sessions (configurable via
TG_MAX_ZERO_SHIP) - Per-task retry cap — skips tasks attempted 3+ times without shipping
- Fast-failure backoff — linear backoff with cap when sessions crash quickly
- Ship-rate tracking — logs cumulative effectiveness in
grind_donesummary, including inferred shipped work when a session removes a completed task but concurrent queue churn keeps the raw task count flat - Productive timeout warning — detects when timeout kills sessions that were shipping
- Unique log names — includes repo basename + PID to prevent collisions
- External injection detection — logs when other processes add tasks mid-run
- Graceful shutdown — SIGINT/SIGTERM waits for running session, pushes commits, ignores duplicate shutdown signals, then exits
Taskgrind runs AI backends with unrestricted permissions (--permission-mode dangerous for Devin, --dangerously-skip-permissions for Claude Code). This is required because sessions need full filesystem and network access to implement tasks autonomously.
Before deploying, ensure:
- You trust the AI backend and the tasks in
TASKS.md - The repo does not contain sensitive credentials that the AI should not access
- You review the
TASKS.mdqueue before starting a long grind
TG_ is the canonical prefix. DVB_ is supported as a backward-compatible alias for all variables.
| Variable | Default | Description |
|---|---|---|
TG_BACKEND |
devin |
AI backend: devin, claude-code, codex |
TG_MODEL |
claude-opus-4-7-max |
AI model (set to an OpenAI model when using --backend codex) |
TG_SKILL |
next-task |
Skill to run each session |
TG_PROMPT |
(none) | Focus prompt for every session |
TG_COOL |
5 |
Seconds between sessions |
TG_MAX_SESSION |
3600 |
Max seconds per session |
TG_MIN_SESSION |
30 |
Fast-failure threshold in seconds |
TG_MAX_FAST |
20 |
Max consecutive fast failures before bail |
TG_MAX_ZERO_SHIP |
50 |
Consecutive zero-ship sessions before bail |
TG_BACKOFF_BASE |
15 |
Base seconds for fast-failure backoff |
TG_BACKOFF_MAX |
120 |
Cap for fast-failure backoff in seconds |
TG_NET_WAIT |
30 |
Network polling interval in seconds |
TG_NET_MAX_WAIT |
14400 |
Max time to wait for network recovery (4h) |
TG_NET_RETRIES |
3 |
Network check retry attempts before declaring down |
TG_NET_RETRY_DELAY |
2 |
Seconds between network check retries |
TG_NET_CHECK_URL |
https://connectivitycheck.gstatic.com/generate_204 |
Override the fallback curl connectivity URL when network-watchdog is unavailable |
TG_GIT_SYNC_TIMEOUT |
30 |
Max seconds for between-session git sync |
TG_SYNC_INTERVAL |
5 |
Git sync every N sessions (0=every) |
TG_EMPTY_QUEUE_WAIT |
600 |
Seconds to wait after an empty sweep before giving up |
TG_EARLY_EXIT_ON_STALL |
0 |
Exit on low throughput (1=enabled) |
TG_MAX_INSTANCES |
2 |
Max concurrent instances per repo |
TG_DEVIN_PATH |
auto | Override devin binary path |
TG_LOG |
auto | Override log file path |
TG_STATUS_FILE |
(disabled) | Write machine-readable runtime status JSON to this path |
TG_NOTIFY |
1 |
Desktop notification on completion |
TG_SHUTDOWN_GRACE |
120 |
Seconds to wait for current session on exit |
TG_SESSION_GRACE |
15 |
Seconds to wait after session SIGINT before SIGTERM |
# Use the log path shown in the startup banner, or:
tail -f "${TMPDIR:-/tmp}"/taskgrind-*.log # watch live progress
cat "${TMPDIR:-/tmp}"/taskgrind-*.log # review completed sessionsEach session logs: start time, remaining minutes, task count, exit code, duration, and shipped count. When a session removes a completed task but concurrent additions, rollover, or non-local queue churn hide that work from the raw before/after task count, taskgrind logs both productive_zero_ship and shipped_inferred so operators can see why the session still counted as shipped. The grind_done summary includes ship rate, remaining tasks, and average session duration.
For machine-readable monitoring, set TG_STATUS_FILE to a JSON file path:
TG_STATUS_FILE=/tmp/taskgrind-status.json taskgrind ~/apps/myrepo 8
cat /tmp/taskgrind-status.jsonThe status file updates atomically on startup, before and after each session, during empty-queue sweeps and wait windows, during network waits, around git-sync decisions, and on final completion or failure. It includes the repo, process ID, slot, backend, skill, model, current session, remaining minutes, current phase, and the most recent session result.
Supervisor example:
#!/bin/sh
status_file="${TMPDIR:-/tmp}/taskgrind-status.json"
phase=$(python3 - <<'PY' "$status_file"
import json, sys
path = sys.argv[1]
with open(path, "r", encoding="utf-8") as handle:
payload = json.load(handle)
print(payload.get("current_phase", "missing"))
print(payload.get("last_session", {}).get("result", "none"))
PY
)
current_phase=$(printf '%s\n' "$phase" | sed -n '1p')
last_result=$(printf '%s\n' "$phase" | sed -n '2p')
case "$current_phase" in
startup|preflight|running_session|running_sweep|session_complete|cooldown|git_sync|git_sync_skipped|queue_refilled|network_restored)
echo "healthy: let the grind keep running"
;;
queue_empty_wait|blocked_wait)
echo "idle: wait unless the repo should have work right now"
;;
waiting_for_network)
echo "degraded: alert only after the outage outlives TG_NET_MAX_WAIT"
;;
failed)
echo "page now: inspect the log and resume after fixing the cause"
;;
complete)
if [ "$last_result" = "completed" ]; then
echo "done: no restart needed unless new tasks arrived"
else
echo "finished with a non-success result: inspect before restarting"
fi
;;
*)
echo "unknown phase: inspect the status file and log before acting"
;;
esacThis pattern works well in launchd, systemd, or a lightweight cron watchdog:
page on failed, keep waiting through queue_empty_wait, and only auto-restart
after complete when new tasks or a fresh schedule justify another grind.
Status payload fields:
| Field | Type | Meaning |
|---|---|---|
repo |
string | Absolute or user-supplied repo path being ground |
pid |
number | Process ID of the current taskgrind run |
slot |
number | Claimed concurrency slot for this repo (0 owns git sync) |
backend |
string | Active backend such as devin, claude-code, or codex |
skill |
string | Skill prompt sent to each session |
model |
string | Resolved model name currently in use |
session |
number | Session counter for the current grind run |
remaining_minutes |
number | Whole minutes left until the current deadline, floored at 0 |
current_phase |
string | Current lifecycle phase such as startup, preflight, running_session, running_sweep, queue_refilled, session_complete, cooldown, git_sync, git_sync_skipped, queue_empty_wait, queue_empty, blocked_wait, all_tasks_blocked, waiting_for_network, network_restored, deadline_expired, audit_focus_blocked, complete, or failed |
terminal_reason |
string or null |
Why a clean run stopped before current_phase rolled to complete; for example all_tasks_blocked, queue_empty, deadline_expired, or audit_focus_blocked |
updated_at |
string | Last write time in local ISO-like timestamp format (%Y-%m-%dT%H:%M:%S%z) |
last_session.number |
number | Most recently finished session number, or 0 before any session completes |
last_session.result |
string | Result label for the most recent session, such as completed, timeout, network_wait, or none before the first session |
last_session.exit_code |
number or null |
Backend exit code for the most recent session, or null before the first completed session |
last_session.shipped |
number | Tasks shipped by the most recent session |
last_session.duration_seconds |
number | Runtime of the most recent session in seconds |
last_session.completed_at |
string | Completion timestamp for the most recent session, or empty string before any session completes |
Example lifecycle snapshots:
{
"repo": "/Users/alex/apps/myrepo",
"pid": 48122,
"slot": 0,
"backend": "devin",
"skill": "next-task",
"model": "claude-opus-4-7-max",
"session": 0,
"remaining_minutes": 479,
"current_phase": "preflight",
"terminal_reason": null,
"updated_at": "2026-04-11T18:05:12-0700",
"last_session": {
"number": 0,
"result": "none",
"exit_code": null,
"shipped": 0,
"duration_seconds": 0,
"completed_at": ""
}
}{
"repo": "/Users/alex/apps/myrepo",
"pid": 48122,
"slot": 0,
"backend": "devin",
"skill": "next-task",
"model": "claude-opus-4-7-max",
"session": 3,
"remaining_minutes": 451,
"current_phase": "running_session",
"updated_at": "2026-04-11T18:33:44-0700",
"last_session": {
"number": 2,
"result": "completed",
"exit_code": 0,
"shipped": 1,
"duration_seconds": 742,
"completed_at": "2026-04-11T18:32:58-0700"
}
}{
"repo": "/Users/alex/apps/myrepo",
"pid": 48122,
"slot": 0,
"backend": "devin",
"skill": "next-task",
"model": "claude-opus-4-7-max",
"session": 3,
"remaining_minutes": 449,
"current_phase": "waiting_for_network",
"updated_at": "2026-04-11T18:35:21-0700",
"last_session": {
"number": 3,
"result": "network_wait",
"exit_code": 1,
"shipped": 0,
"duration_seconds": 12,
"completed_at": "2026-04-11T18:35:19-0700"
}
}{
"repo": "/Users/alex/apps/myrepo",
"pid": 48122,
"slot": 0,
"backend": "devin",
"skill": "next-task",
"model": "claude-opus-4-7-max",
"session": 7,
"remaining_minutes": 0,
"current_phase": "complete",
"updated_at": "2026-04-12T02:05:01-0700",
"last_session": {
"number": 7,
"result": "completed",
"exit_code": 0,
"shipped": 1,
"duration_seconds": 801,
"completed_at": "2026-04-12T02:04:55-0700"
}
}In practice, current_phase moves from startup and preflight into active work (running_sweep or running_session), then through transitional phases such as queue_refilled, session_complete, cooldown, git_sync, git_sync_skipped, queue_empty_wait, or blocked_wait. Temporary interruptions show up as waiting_for_network and then network_restored. Sweep-only runs still record the sweep as the latest completed session before normal shutdown rewrites the file one last time as complete; argument or runtime failures finish as failed.
Watchdog mapping for the less obvious phases:
startup/preflight: process is initializing, validating inputs, and claiming a slotrunning_session/running_sweep: active work is in progressqueue_refilled: an empty-queue wait saw new work; let the process continue into the next sessionsession_complete: a session just ended and taskgrind is about to decide between cooldown, waits, or shutdowncooldown: healthy pause between sessionsgit_sync: slot0is running the between-session fetch/rebase cyclegit_sync_skipped: a higher slot intentionally skipped git sync; this is healthy for multi-instance runsqueue_empty_wait/blocked_wait: intentionally idle; wait for queue changes instead of restartingqueue_empty,all_tasks_blocked,deadline_expired, andaudit_focus_blocked: stop reasons that also land interminal_reasonon the finalcompletesnapshot so slower monitors do not miss why a clean grind stoppedwaiting_for_network: degraded but recoverable; taskgrind is extending the deadline while connectivity is downnetwork_restored: connectivity recovered and the process is about to resume normal workcomplete/failed: terminal states for the current process
While taskgrind is running, create or edit .taskgrind-prompt in the target repo to add instructions to every subsequent session:
echo "focus on test coverage" > ~/apps/myrepo/.taskgrind-promptThe file is re-read before each session. Combined with --prompt if both are set. Delete the file to stop injecting.
Files larger than 10 KB are skipped as a safety guard to avoid accidentally
injecting generated output or other large blobs, and taskgrind logs a warning
like ⚠ .taskgrind-prompt too large (12345B > 10240B) — skipping so operators
can see why the override did not apply.
Switch models mid-grind without restarting — useful for switching from a powerful model to a faster one for simpler tasks:
echo "claude-sonnet-4.6" > ~/apps/myrepo/.taskgrind-modelThe file is re-read before each session. Overrides --model and TG_MODEL when present. Short aliases such as opus, sonnet, haiku, codex, gpt, and swe resolve to the current preferred model IDs. Delete the file to revert to the original startup model. Files larger than 1 KB are skipped as a safety guard, and taskgrind logs a warning like ⚠ .taskgrind-model too large (2048B > 1024B) — skipping.
Both override files are only applied between sessions. The current in-flight session keeps its original prompt and model, and the next session picks up the updated file content.
By default, taskgrind allows two concurrent grinds on the same repo. Raise
TG_MAX_INSTANCES above 2 to allow more:
TG_MAX_INSTANCES=3 taskgrind ~/apps/myrepo 8Each running grind claims the lowest free slot (0, 1, ...). Slot 0 remains the primary instance and owns the between-session git sync. Higher slots skip that sync and get extra prompt guidance to avoid overlapping file edits, which keeps one terminal responsible for fetch/rebase instead of letting multiple sessions fight over the same queue and branch state.
Operator example for a three-slot run:
# Terminal 1: primary instance
TG_MAX_INSTANCES=3 taskgrind ~/apps/myrepo 8
# Terminal 2: second worker
TG_MAX_INSTANCES=3 taskgrind ~/apps/myrepo 8
# Inspect current ownership before launching a third worker
TG_MAX_INSTANCES=3 taskgrind --preflight ~/apps/myrepoExpected preflight header while two grinds are already active:
taskgrind --preflight
repo: /Users/you/apps/myrepo
backend: devin
skill: next-task
model: claude-opus-4-7-max
slots: 2/3 active
Conflict-avoidance expectations by slot:
slot 0is the only instance that performs the between-sessiongit fetch/rebasesync cycleslot 1+skips that sync, rebases just before committing, and should preferTASKS.mdupdates, audits, docs, or other non-overlapping files when slot 0 is editing code- If all slots are occupied, taskgrind prints which process owns each slot and tells you to raise
TG_MAX_INSTANCESbefore starting another grind
Supported two-stream workflow for one repo:
- Keep
slot 0on the normalnext-tasklane so it keeps shipping removable work fromTASKS.md - Put
slot 1on a discovery skill such asstanding-audit-gap-loop, but back it with the reusable standing-loop pattern instead of a sacrificial repo-local audit task - Define that discovery lane task in
TASKS.mdwith durable metadata such as**ID**: discovery-standing-loopand**Tags**: standing-loop, audit, queue; taskgrind treats that as a valid queue-maintenance lane even though the task definition itself is meant to persist - Let the discovery lane add normal tasks back into
TASKS.md;slot 0then picks them up and removes only the shipped work items, while the standing-loop definition remains available for the next discovery pass - If you point taskgrind at an audit-only skill without that standing-loop marker, taskgrind refuses audit-only sessions unless
TASKS.mdalready contains a supported discovery-lane standing-loop task
Example standing-loop definition:
# Tasks
## P0
- [ ] Keep the discovery lane replenishing the queue
**ID**: discovery-standing-loop
**Tags**: standing-loop, audit, queue
**Details**: Continuously discover high-value follow-up work for slot 0 to ship.
**Files**: `TASKS.md`, `docs/user-stories.md`
**Acceptance**: The discovery lane keeps adding normal removable tasks while this standing-loop definition remains available for the next pass.If taskgrind is interrupted unexpectedly, rerun it with --resume in the same repo:
taskgrind --resume ~/apps/myrepoPlain taskgrind --resume ~/apps/myrepo is enough only when the interrupted
run used the same startup defaults you are using now. If the interrupted run
started with explicit --backend, --model, --skill, or baseline
--prompt / TG_PROMPT overrides, repeat those same choices on the resume
command.
Taskgrind saves resumable runtime state in ~/apps/myrepo/.taskgrind-state while the grind is active. A resumed run restores the original deadline, session counter, shipped count, backend, skill, model, and baseline focus prompt instead of starting from session 1 again.
The saved state file is a flat key=value snapshot, not JSON. Today it stores
the schema version, absolute repo, resumability status, deadline,
session, tasks_shipped, sessions_zero_ship, consecutive_zero_ship,
backend, skill, model, startup_model, and startup_prompt. The saved
focus prompt is the baseline --prompt or TG_PROMPT text from startup;
repo-local .taskgrind-prompt edits still stay live-only and are re-read on
resume. See docs/resume-state.md for the current contract and validation
rules.
Use --resume when the previous run was interrupted by a terminal crash,
reboot, or similar external interruption. Prefer a fresh taskgrind launch
when you intentionally want a new deadline or different runtime settings. If
the saved deadline already expired, taskgrind rejects the stale state and tells
you to start fresh. Resume also requires the original --backend, --model,
--skill, and baseline --prompt / TG_PROMPT inputs to match. If you try to
resume with different overrides, taskgrind rejects that mismatch explicitly so
a resumed grind does not silently change direction.
Use this playbook when an unattended grind looks stuck, blocked, or noisy. Start
with the status file when TG_STATUS_FILE is enabled, then confirm the same
story in the log named in the startup banner.
| Symptom | Inspect | Recovery |
|---|---|---|
| Queue looks stuck even though the process is alive | current_phase in TG_STATUS_FILE; log lines containing queue_empty_wait, blocked_wait, or running_sweep |
If the phase is queue_empty_wait or blocked_wait, leave the grind running while another agent or operator refills or unblocks TASKS.md. If the repo should already have work, open TASKS.md and fix claimed/blocking entries instead of restarting immediately. |
| Another terminal says the repo is busy or a new worker will not start | taskgrind --preflight ~/apps/myrepo for slots: N/M active; the active-slot owner list in preflight output; current_phase in TG_STATUS_FILE for the active worker |
Wait for a slot to free up, or raise TG_MAX_INSTANCES before starting another grind. Keep slot 0 as the sync owner; point higher slots at docs, audits, TASKS.md maintenance, or status-file supervision instead of overlapping code edits. |
| Sessions keep ending with zero shipped tasks | last_session.result, last_session.shipped, and log markers such as productive_zero_ship, shipped_inferred, or repeated tasks_after= counts |
Read the last few session summaries before killing the run. If the queue is churning under another agent, taskgrind may still be shipping work. If the same task is being retried without progress, tighten the prompt, split the task, or remove the blocker in TASKS.md before resuming. |
| Network outages pause progress for too long | current_phase=waiting_for_network; log lines around connectivity retries and network_restored |
Let taskgrind hold the deadline open during short outages. If the outage exceeds TG_NET_MAX_WAIT, restore connectivity first, then resume with the same repo plus the original startup overrides so the saved backend/model/skill/prompt contract still matches. |
--resume refuses to continue |
The rejection message in stderr; .taskgrind-state; docs/resume-state.md for the saved field contract |
Fix the mismatch the message calls out: rerun with the same repo plus the same --backend, --model, --skill, and baseline --prompt / TG_PROMPT inputs, restore the missing state file, or start a fresh grind if the deadline already expired. Do not copy stale state across repos. |
| Final push or sync fails during shutdown | The final git push / git pull --rebase lines in the log; git status --short; git log --oneline --decorate -5 |
Resolve the git problem in the repo first, usually with git pull --rebase for incoming changes or by fixing the rejected push target. Then rerun resume with the same repo plus the original startup overrides if the interrupted run did not use pure defaults. |
Safe recovery loop:
- Read
TG_STATUS_FILEto learn whether the grind is working, waiting, or failed. - Tail the matching log file to confirm the latest session result and git state.
- If slot
0is already active, keep later slots on supervision or other non-overlapping work until the sync lane is free. - Run
taskgrind --preflight ~/apps/myrepobefore adding more workers or after clearing a blocker. - Prefer
taskgrind --resume ~/apps/myrepoafter crashes, reboots, or push failures when the original run used the same startup defaults you still want. Otherwise repeat the original--backend,--model,--skill, and baseline--prompt/TG_PROMPTchoices on the resume command so validation succeeds. - If resume is rejected, retry with the original startup overrides or start a fresh run on purpose.
make install # symlink to /usr/local/bin + install man page
make audit # run the local repo audit workflow
make lint # shellcheck
make test # bats test suite (cached, auto-capped parallelism)
make test-force # rerun the selected bats suite without cache
make test TESTS=tests/bash-compat.bats # targeted rerun with its own cache key
make test TEST_JOBS=4 # override the auto-capped parallelism for diagnostics
make check # lint + test
make uninstall # remove symlink and man pageRequires: bats-core, shellcheck
Taskgrind runtime files must stay compatible with /bin/bash 3.2, and
tests/verify-bash32-compat.sh is the guard that enforces that contract during
the bats suite.
# macOS
brew install bats-core shellcheck
# Ubuntu / Debian
sudo apt-get update
sudo apt-get install -y npm shellcheck
sudo npm install -g bats
# Fedora / RHEL
sudo dnf install -y bats ShellCheckOn Linux, the supported bats install path is the npm flow above so local
make check runs match the GitHub Actions CI environment.
Extracted from dotfiles where it lived as dvb-grind. The dvb-grind name still works as a shell alias in dotfiles for backward compatibility.
- User Stories — real usage patterns with commands and sample output
- Architecture — design decisions and rationale
- Resume State — saved-state fields, validation rules, and restore behavior
MIT