Skip to content

fyodoriv/taskgrind

Repository files navigation

taskgrind

check

TL;DR

Taskgrind runs repeated AI coding sessions against any repo that keeps its queue in TASKS.md, stopping when the deadline, queue state, or stall guard says the run is done. Use taskgrind --preflight to verify the backend and repo before a long run, then steer later sessions with repo-local prompt or model overrides instead of restarting the whole grind.

Sessions should exit before context fills; context exhaustion can crash the process and lose uncommitted work.

Autonomous multi-session grind — runs sequential AI coding sessions until a deadline. Each session starts with full context. State lives in TASKS.md + git, so sessions pick up seamlessly. Sessions still need to exit before the model context fills up; a context-exhausted crash can drop any uncommitted work from that session.

Taskgrind ships built-in backends for Devin, Claude Code, and Codex, and it works with any repo that uses the tasks.md spec for task management.

For local tests and repo audit helpers, keep DVB_GRIND_CMD to a single executable path. If you need a compound shell command, wrap it in a helper script first so preflight and session launch can validate it correctly.

Prerequisites

Requires macOS or Linux (or WSL on Windows).

You need at least one AI coding backend installed:

Backend Install
Devin CLI curl -fsSL https://cli.devin.ai/install.sh | sh
Claude Code npm install -g @anthropic-ai/claude-code
Codex npm install -g @openai/codex

Taskgrind defaults to Devin. Use --backend claude-code or --backend codex to switch.

Backend setup matrix

Use taskgrind --preflight ~/apps/myrepo after installing a backend. The same checks run before a real grind starts, so this is the fastest way to confirm the binary, model, and network assumptions for the backend you chose.

Backend Binary taskgrind looks for Model validation before session 1 Most actionable setup failures
devin devin from PATH, or TG_DEVIN_PATH if you override it Validates the requested model by running devin --model "$TG_MODEL" --help during preflight Backend binary not found (devin) means the CLI is missing or TG_DEVIN_PATH points at the wrong file. Model rejected by devin before starting means the model string is wrong for your Devin install. If the startup probe says the binary is a stub or broken after --version, reinstall or roll back the Devin CLI before retrying.
claude-code claude from PATH Validates the requested model by running claude --model "$TG_MODEL" --help during preflight Backend binary not found (claude-code) usually means @anthropic-ai/claude-code is not installed globally or claude is not on PATH. Model rejected by claude-code before starting means the selected Claude model is unavailable to that install or account.
codex codex from PATH Validates the requested model by running codex --model "$TG_MODEL" --help during preflight Backend binary not found (codex) means the Codex CLI is missing from PATH. If you keep the default Anthropic-flavored model while using --backend codex, taskgrind warns before launch because Codex expects an OpenAI model such as o3 or gpt-5.4. A later Model rejected by codex before starting failure means the chosen OpenAI model name is not accepted by your local Codex install.

Practical examples:

taskgrind --preflight ~/apps/myrepo
taskgrind --preflight --backend claude-code --model claude-sonnet-4.6 ~/apps/myrepo
taskgrind --preflight --backend codex --model o3 ~/apps/myrepo

Install

Homebrew (macOS / Linux)

brew install cbrwizard/tap/taskgrind

Manual

# One-liner
curl -fsSL https://raw.githubusercontent.com/cbrwizard/taskgrind/main/install.sh | sh

# Or clone manually
git clone https://github.com/cbrwizard/taskgrind.git ~/apps/taskgrind

# Custom install directory
TASKGRIND_INSTALL_DIR=~/tools/taskgrind sh -c "$(curl -fsSL https://raw.githubusercontent.com/cbrwizard/taskgrind/main/install.sh)"

# Add to PATH (add to your shell rc)
export PATH="$HOME/apps/taskgrind/bin:$PATH"

To update: brew upgrade taskgrind (Homebrew) or cd ~/apps/taskgrind && git pull --rebase (manual)

Contributor audit shortcut: run make audit to reproduce the local repo-audit pass (an actionable scan for real task markers, plus the core docs and repo-local audit skills, shellcheck, and the core docs review queue, including README.md, CONTRIBUTING.md, SECURITY.md, AGENTS.md, Agentfile.yaml, docs/architecture.md, docs/resume-state.md, docs/user-stories.md, man/taskgrind.1, .devin/skills/standing-audit-gap-loop/SKILL.md, and .devin/skills/grind-log-analyze/SKILL.md) without any network-only dependencies.

Usage

taskgrind                              # 10h grind (default), current dir
taskgrind 10                           # 10h grind
taskgrind ~/apps/myrepo 10             # 10h grind in specific repo
taskgrind --model claude-opus-4-7-max 8 # use specific model
taskgrind --model "gpt-5.4 XHigh thinking fast" 8  # quote multi-word model names
taskgrind --skill pipeline-ops ~/apps/bosun 10  # custom installed skill
taskgrind --prompt "focus on test coverage" 8  # focus prompt
taskgrind --backend claude-code 8       # use Claude Code backend
taskgrind --dry-run 8 ~/apps/myrepo    # print config without running
taskgrind --preflight ~/apps/myrepo    # run health checks only
taskgrind --resume ~/apps/myrepo       # resume an interrupted grind
taskgrind --help / -h                  # show usage and environment variables
taskgrind --version / -V               # print version (commit hash + date)
TG_MODEL=sonnet taskgrind 8            # pick a model alias without changing shell history
TG_BACKEND=codex taskgrind 8           # make a wrapper or terminal default use Codex
TG_MAX_INSTANCES=3 taskgrind ~/apps/myrepo 8  # allow three concurrent grinds per repo
TG_STATUS_FILE=/tmp/taskgrind-status.json taskgrind ~/apps/myrepo 8  # write machine-readable status snapshots

Arguments can appear in any order. Hours is any bare integer 1-24.

Env vars are especially useful when you wrap taskgrind in launchd, cron, shell aliases, or a small supervisor script. Use flags when you want a one-off override in your shell history; use TG_BACKEND or TG_MODEL when you want restarts and helper scripts to inherit the same defaults without retyping them on every launch.

--skill accepts any installed skill. Use repo-local skills such as standing-audit-gap-loop or globally installed skills such as pipeline-ops when you want a lane other than the default next-task workflow.

How It Works

  1. Launches an AI session with the next-task skill (configurable via --skill, backend via --backend)
  2. Session picks a task from TASKS.md, implements it, commits, and exits
  3. Between sessions: cooldown, optional git sync (every N sessions)
  4. Exits when: queue empty, all remaining tasks blocked, deadline reached, or stall detected

That session boundary is also the context-budget guard: keep prompts, plans, and scope small enough that each agent run can finish and commit before its context window fills. If a session crashes from context exhaustion, taskgrind can resume from git and TASKS.md, but any uncommitted edits from the crashed run are gone.

If a run stops because of a reboot, terminal loss, or a recoverable backend failure, taskgrind --resume <repo> restores the saved deadline, counters, backend, skill, startup prompt baseline, and startup model baseline for that same grind. It deliberately does not resurrect uncommitted edits from the interrupted session, so treat resume as "continue from the last clean commit," not "recover everything that was in memory." See docs/resume-state.md for the exact validation contract and docs/user-stories.md for an operator-facing resume example.

Task format

Taskgrind reads TASKS.md following the tasks.md spec. Tasks use checkbox format under priority headings:

# Tasks

## P0
- [ ] Fix critical bug in auth flow
  **ID**: fix-auth-bug
  **Tags**: bug, auth
  **Details**: The OAuth callback fails when...
  **Files**: `bin/taskgrind`, `tests/preflight.bats`
  **Acceptance**: Users can complete the OAuth callback without a retry loop.

## P1
- [ ] Add retry logic to API calls
  **ID**: add-api-retry
  **Tags**: reliability, api
  **Details**: Retries should cover transient 502/503 responses only.
  **Files**: `bin/taskgrind`, `tests/network.bats`
  **Acceptance**: Transient upstream failures retry with backoff and permanent failures still exit fast.
  **Blocked by**: backend-rate-limit-policy

Use **Blocked by** only when another task or external dependency truly prevents progress. Completed tasks are removed (not checked off). History lives in git log. See the tasks.md spec for the full format.

Features

  • Multi-backend support — works with Devin, Claude Code, and Codex via --backend
  • Model selection--model claude-opus-4-7-max or TG_MODEL=claude-opus-4-7-max to use any model the backend supports; quote multi-word model names such as --model "gpt-5.4 XHigh thinking fast"; short aliases like opus and sonnet resolve to the current preferred model IDs
  • Live model switching — create/edit .taskgrind-model in the repo while running; changes take effect at the next session, including short alias resolution. Delete the file to revert to the startup model. Files larger than 1 KB are ignored with a warning.
  • Live prompt injection — create/edit .taskgrind-prompt in the repo while running; changes take effect at the next session. Files larger than 10 KB are ignored with a warning.
  • Preflight checks — validates the backend, network, repo, disk, queue, and optional watchdog setup before launch, plus active slot reporting. network-watchdog is optional; if missing, taskgrind falls back to curl for connectivity checks.
  • Self-copy protection — copies itself to $TMPDIR before running, survives script edits mid-grind
  • Slot-based per-repo lockingTG_MAX_INSTANCES allows multiple concurrent grinds on the same repo; slot 0 owns between-session git sync, higher slots get conflict-avoidance prompt guidance
  • Blocked-queue detection — exits early when all remaining tasks have **Blocked by**: metadata
  • Caffeinate integration — prevents system sleep on macOS (caffeinate) and Linux (systemd-inhibit)
  • Git sync with stash/rebase — between-session sync stashes dirty work, auto-detects the repo default branch from origin/HEAD, remote HEAD probes, upstream tracking, or local branch fallbacks, then rebases there and cleans merged branches; tests can force the branch with DVB_DEFAULT_BRANCH. If stash creation fails, taskgrind logs the original git error and skips stash pop; if stash pop fails after a successful stash, it leaves the stash intact for manual recovery. When a rebase conflict only touches TASKS.md, taskgrind now auto-resolves it by keeping the local queue edit so queue churn does not leave the repo stuck mid-rebase.
  • Empty-queue sweep — when TASKS.md is empty, launches a sweep session to find work, then waits for external task injection before exiting
  • Network resilience — pauses on network loss, extends deadline on recovery
  • Stall detection — bails after consecutive zero-ship sessions (configurable via TG_MAX_ZERO_SHIP)
  • Per-task retry cap — skips tasks attempted 3+ times without shipping
  • Fast-failure backoff — linear backoff with cap when sessions crash quickly
  • Ship-rate tracking — logs cumulative effectiveness in grind_done summary, including inferred shipped work when a session removes a completed task but concurrent queue churn keeps the raw task count flat
  • Productive timeout warning — detects when timeout kills sessions that were shipping
  • Unique log names — includes repo basename + PID to prevent collisions
  • External injection detection — logs when other processes add tasks mid-run
  • Graceful shutdown — SIGINT/SIGTERM waits for running session, pushes commits, ignores duplicate shutdown signals, then exits

Security

Taskgrind runs AI backends with unrestricted permissions (--permission-mode dangerous for Devin, --dangerously-skip-permissions for Claude Code). This is required because sessions need full filesystem and network access to implement tasks autonomously.

Before deploying, ensure:

  • You trust the AI backend and the tasks in TASKS.md
  • The repo does not contain sensitive credentials that the AI should not access
  • You review the TASKS.md queue before starting a long grind

Environment Variables

TG_ is the canonical prefix. DVB_ is supported as a backward-compatible alias for all variables.

Variable Default Description
TG_BACKEND devin AI backend: devin, claude-code, codex
TG_MODEL claude-opus-4-7-max AI model (set to an OpenAI model when using --backend codex)
TG_SKILL next-task Skill to run each session
TG_PROMPT (none) Focus prompt for every session
TG_COOL 5 Seconds between sessions
TG_MAX_SESSION 3600 Max seconds per session
TG_MIN_SESSION 30 Fast-failure threshold in seconds
TG_MAX_FAST 20 Max consecutive fast failures before bail
TG_MAX_ZERO_SHIP 50 Consecutive zero-ship sessions before bail
TG_BACKOFF_BASE 15 Base seconds for fast-failure backoff
TG_BACKOFF_MAX 120 Cap for fast-failure backoff in seconds
TG_NET_WAIT 30 Network polling interval in seconds
TG_NET_MAX_WAIT 14400 Max time to wait for network recovery (4h)
TG_NET_RETRIES 3 Network check retry attempts before declaring down
TG_NET_RETRY_DELAY 2 Seconds between network check retries
TG_NET_CHECK_URL https://connectivitycheck.gstatic.com/generate_204 Override the fallback curl connectivity URL when network-watchdog is unavailable
TG_GIT_SYNC_TIMEOUT 30 Max seconds for between-session git sync
TG_SYNC_INTERVAL 5 Git sync every N sessions (0=every)
TG_EMPTY_QUEUE_WAIT 600 Seconds to wait after an empty sweep before giving up
TG_EARLY_EXIT_ON_STALL 0 Exit on low throughput (1=enabled)
TG_MAX_INSTANCES 2 Max concurrent instances per repo
TG_DEVIN_PATH auto Override devin binary path
TG_LOG auto Override log file path
TG_STATUS_FILE (disabled) Write machine-readable runtime status JSON to this path
TG_NOTIFY 1 Desktop notification on completion
TG_SHUTDOWN_GRACE 120 Seconds to wait for current session on exit
TG_SESSION_GRACE 15 Seconds to wait after session SIGINT before SIGTERM

Monitoring

# Use the log path shown in the startup banner, or:
tail -f "${TMPDIR:-/tmp}"/taskgrind-*.log   # watch live progress
cat "${TMPDIR:-/tmp}"/taskgrind-*.log       # review completed sessions

Each session logs: start time, remaining minutes, task count, exit code, duration, and shipped count. When a session removes a completed task but concurrent additions, rollover, or non-local queue churn hide that work from the raw before/after task count, taskgrind logs both productive_zero_ship and shipped_inferred so operators can see why the session still counted as shipped. The grind_done summary includes ship rate, remaining tasks, and average session duration.

For machine-readable monitoring, set TG_STATUS_FILE to a JSON file path:

TG_STATUS_FILE=/tmp/taskgrind-status.json taskgrind ~/apps/myrepo 8
cat /tmp/taskgrind-status.json

The status file updates atomically on startup, before and after each session, during empty-queue sweeps and wait windows, during network waits, around git-sync decisions, and on final completion or failure. It includes the repo, process ID, slot, backend, skill, model, current session, remaining minutes, current phase, and the most recent session result.

Supervisor example:

#!/bin/sh
status_file="${TMPDIR:-/tmp}/taskgrind-status.json"

phase=$(python3 - <<'PY' "$status_file"
import json, sys
path = sys.argv[1]
with open(path, "r", encoding="utf-8") as handle:
    payload = json.load(handle)
print(payload.get("current_phase", "missing"))
print(payload.get("last_session", {}).get("result", "none"))
PY
)

current_phase=$(printf '%s\n' "$phase" | sed -n '1p')
last_result=$(printf '%s\n' "$phase" | sed -n '2p')

case "$current_phase" in
  startup|preflight|running_session|running_sweep|session_complete|cooldown|git_sync|git_sync_skipped|queue_refilled|network_restored)
    echo "healthy: let the grind keep running"
    ;;
  queue_empty_wait|blocked_wait)
    echo "idle: wait unless the repo should have work right now"
    ;;
  waiting_for_network)
    echo "degraded: alert only after the outage outlives TG_NET_MAX_WAIT"
    ;;
  failed)
    echo "page now: inspect the log and resume after fixing the cause"
    ;;
  complete)
    if [ "$last_result" = "completed" ]; then
      echo "done: no restart needed unless new tasks arrived"
    else
      echo "finished with a non-success result: inspect before restarting"
    fi
    ;;
  *)
    echo "unknown phase: inspect the status file and log before acting"
    ;;
esac

This pattern works well in launchd, systemd, or a lightweight cron watchdog: page on failed, keep waiting through queue_empty_wait, and only auto-restart after complete when new tasks or a fresh schedule justify another grind.

Status payload fields:

Field Type Meaning
repo string Absolute or user-supplied repo path being ground
pid number Process ID of the current taskgrind run
slot number Claimed concurrency slot for this repo (0 owns git sync)
backend string Active backend such as devin, claude-code, or codex
skill string Skill prompt sent to each session
model string Resolved model name currently in use
session number Session counter for the current grind run
remaining_minutes number Whole minutes left until the current deadline, floored at 0
current_phase string Current lifecycle phase such as startup, preflight, running_session, running_sweep, queue_refilled, session_complete, cooldown, git_sync, git_sync_skipped, queue_empty_wait, queue_empty, blocked_wait, all_tasks_blocked, waiting_for_network, network_restored, deadline_expired, audit_focus_blocked, complete, or failed
terminal_reason string or null Why a clean run stopped before current_phase rolled to complete; for example all_tasks_blocked, queue_empty, deadline_expired, or audit_focus_blocked
updated_at string Last write time in local ISO-like timestamp format (%Y-%m-%dT%H:%M:%S%z)
last_session.number number Most recently finished session number, or 0 before any session completes
last_session.result string Result label for the most recent session, such as completed, timeout, network_wait, or none before the first session
last_session.exit_code number or null Backend exit code for the most recent session, or null before the first completed session
last_session.shipped number Tasks shipped by the most recent session
last_session.duration_seconds number Runtime of the most recent session in seconds
last_session.completed_at string Completion timestamp for the most recent session, or empty string before any session completes

Example lifecycle snapshots:

{
  "repo": "/Users/alex/apps/myrepo",
  "pid": 48122,
  "slot": 0,
  "backend": "devin",
  "skill": "next-task",
  "model": "claude-opus-4-7-max",
  "session": 0,
  "remaining_minutes": 479,
  "current_phase": "preflight",
  "terminal_reason": null,
  "updated_at": "2026-04-11T18:05:12-0700",
  "last_session": {
    "number": 0,
    "result": "none",
    "exit_code": null,
    "shipped": 0,
    "duration_seconds": 0,
    "completed_at": ""
  }
}
{
  "repo": "/Users/alex/apps/myrepo",
  "pid": 48122,
  "slot": 0,
  "backend": "devin",
  "skill": "next-task",
  "model": "claude-opus-4-7-max",
  "session": 3,
  "remaining_minutes": 451,
  "current_phase": "running_session",
  "updated_at": "2026-04-11T18:33:44-0700",
  "last_session": {
    "number": 2,
    "result": "completed",
    "exit_code": 0,
    "shipped": 1,
    "duration_seconds": 742,
    "completed_at": "2026-04-11T18:32:58-0700"
  }
}
{
  "repo": "/Users/alex/apps/myrepo",
  "pid": 48122,
  "slot": 0,
  "backend": "devin",
  "skill": "next-task",
  "model": "claude-opus-4-7-max",
  "session": 3,
  "remaining_minutes": 449,
  "current_phase": "waiting_for_network",
  "updated_at": "2026-04-11T18:35:21-0700",
  "last_session": {
    "number": 3,
    "result": "network_wait",
    "exit_code": 1,
    "shipped": 0,
    "duration_seconds": 12,
    "completed_at": "2026-04-11T18:35:19-0700"
  }
}
{
  "repo": "/Users/alex/apps/myrepo",
  "pid": 48122,
  "slot": 0,
  "backend": "devin",
  "skill": "next-task",
  "model": "claude-opus-4-7-max",
  "session": 7,
  "remaining_minutes": 0,
  "current_phase": "complete",
  "updated_at": "2026-04-12T02:05:01-0700",
  "last_session": {
    "number": 7,
    "result": "completed",
    "exit_code": 0,
    "shipped": 1,
    "duration_seconds": 801,
    "completed_at": "2026-04-12T02:04:55-0700"
  }
}

In practice, current_phase moves from startup and preflight into active work (running_sweep or running_session), then through transitional phases such as queue_refilled, session_complete, cooldown, git_sync, git_sync_skipped, queue_empty_wait, or blocked_wait. Temporary interruptions show up as waiting_for_network and then network_restored. Sweep-only runs still record the sweep as the latest completed session before normal shutdown rewrites the file one last time as complete; argument or runtime failures finish as failed.

Watchdog mapping for the less obvious phases:

  • startup / preflight: process is initializing, validating inputs, and claiming a slot
  • running_session / running_sweep: active work is in progress
  • queue_refilled: an empty-queue wait saw new work; let the process continue into the next session
  • session_complete: a session just ended and taskgrind is about to decide between cooldown, waits, or shutdown
  • cooldown: healthy pause between sessions
  • git_sync: slot 0 is running the between-session fetch/rebase cycle
  • git_sync_skipped: a higher slot intentionally skipped git sync; this is healthy for multi-instance runs
  • queue_empty_wait / blocked_wait: intentionally idle; wait for queue changes instead of restarting
  • queue_empty, all_tasks_blocked, deadline_expired, and audit_focus_blocked: stop reasons that also land in terminal_reason on the final complete snapshot so slower monitors do not miss why a clean grind stopped
  • waiting_for_network: degraded but recoverable; taskgrind is extending the deadline while connectivity is down
  • network_restored: connectivity recovered and the process is about to resume normal work
  • complete / failed: terminal states for the current process

Live prompt injection

While taskgrind is running, create or edit .taskgrind-prompt in the target repo to add instructions to every subsequent session:

echo "focus on test coverage" > ~/apps/myrepo/.taskgrind-prompt

The file is re-read before each session. Combined with --prompt if both are set. Delete the file to stop injecting. Files larger than 10 KB are skipped as a safety guard to avoid accidentally injecting generated output or other large blobs, and taskgrind logs a warning like ⚠ .taskgrind-prompt too large (12345B > 10240B) — skipping so operators can see why the override did not apply.

Live model switching

Switch models mid-grind without restarting — useful for switching from a powerful model to a faster one for simpler tasks:

echo "claude-sonnet-4.6" > ~/apps/myrepo/.taskgrind-model

The file is re-read before each session. Overrides --model and TG_MODEL when present. Short aliases such as opus, sonnet, haiku, codex, gpt, and swe resolve to the current preferred model IDs. Delete the file to revert to the original startup model. Files larger than 1 KB are skipped as a safety guard, and taskgrind logs a warning like ⚠ .taskgrind-model too large (2048B > 1024B) — skipping.

Both override files are only applied between sessions. The current in-flight session keeps its original prompt and model, and the next session picks up the updated file content.

Concurrent instances on one repo

By default, taskgrind allows two concurrent grinds on the same repo. Raise TG_MAX_INSTANCES above 2 to allow more:

TG_MAX_INSTANCES=3 taskgrind ~/apps/myrepo 8

Each running grind claims the lowest free slot (0, 1, ...). Slot 0 remains the primary instance and owns the between-session git sync. Higher slots skip that sync and get extra prompt guidance to avoid overlapping file edits, which keeps one terminal responsible for fetch/rebase instead of letting multiple sessions fight over the same queue and branch state.

Operator example for a three-slot run:

# Terminal 1: primary instance
TG_MAX_INSTANCES=3 taskgrind ~/apps/myrepo 8

# Terminal 2: second worker
TG_MAX_INSTANCES=3 taskgrind ~/apps/myrepo 8

# Inspect current ownership before launching a third worker
TG_MAX_INSTANCES=3 taskgrind --preflight ~/apps/myrepo

Expected preflight header while two grinds are already active:

taskgrind --preflight
  repo:     /Users/you/apps/myrepo
  backend:  devin
  skill:    next-task
  model:    claude-opus-4-7-max
  slots:    2/3 active

Conflict-avoidance expectations by slot:

  • slot 0 is the only instance that performs the between-session git fetch / rebase sync cycle
  • slot 1+ skips that sync, rebases just before committing, and should prefer TASKS.md updates, audits, docs, or other non-overlapping files when slot 0 is editing code
  • If all slots are occupied, taskgrind prints which process owns each slot and tells you to raise TG_MAX_INSTANCES before starting another grind

Supported two-stream workflow for one repo:

  • Keep slot 0 on the normal next-task lane so it keeps shipping removable work from TASKS.md
  • Put slot 1 on a discovery skill such as standing-audit-gap-loop, but back it with the reusable standing-loop pattern instead of a sacrificial repo-local audit task
  • Define that discovery lane task in TASKS.md with durable metadata such as **ID**: discovery-standing-loop and **Tags**: standing-loop, audit, queue; taskgrind treats that as a valid queue-maintenance lane even though the task definition itself is meant to persist
  • Let the discovery lane add normal tasks back into TASKS.md; slot 0 then picks them up and removes only the shipped work items, while the standing-loop definition remains available for the next discovery pass
  • If you point taskgrind at an audit-only skill without that standing-loop marker, taskgrind refuses audit-only sessions unless TASKS.md already contains a supported discovery-lane standing-loop task

Example standing-loop definition:

# Tasks

## P0
- [ ] Keep the discovery lane replenishing the queue
  **ID**: discovery-standing-loop
  **Tags**: standing-loop, audit, queue
  **Details**: Continuously discover high-value follow-up work for slot 0 to ship.
  **Files**: `TASKS.md`, `docs/user-stories.md`
  **Acceptance**: The discovery lane keeps adding normal removable tasks while this standing-loop definition remains available for the next pass.

Resuming an interrupted grind

If taskgrind is interrupted unexpectedly, rerun it with --resume in the same repo:

taskgrind --resume ~/apps/myrepo

Plain taskgrind --resume ~/apps/myrepo is enough only when the interrupted run used the same startup defaults you are using now. If the interrupted run started with explicit --backend, --model, --skill, or baseline --prompt / TG_PROMPT overrides, repeat those same choices on the resume command.

Taskgrind saves resumable runtime state in ~/apps/myrepo/.taskgrind-state while the grind is active. A resumed run restores the original deadline, session counter, shipped count, backend, skill, model, and baseline focus prompt instead of starting from session 1 again.

The saved state file is a flat key=value snapshot, not JSON. Today it stores the schema version, absolute repo, resumability status, deadline, session, tasks_shipped, sessions_zero_ship, consecutive_zero_ship, backend, skill, model, startup_model, and startup_prompt. The saved focus prompt is the baseline --prompt or TG_PROMPT text from startup; repo-local .taskgrind-prompt edits still stay live-only and are re-read on resume. See docs/resume-state.md for the current contract and validation rules.

Use --resume when the previous run was interrupted by a terminal crash, reboot, or similar external interruption. Prefer a fresh taskgrind launch when you intentionally want a new deadline or different runtime settings. If the saved deadline already expired, taskgrind rejects the stale state and tells you to start fresh. Resume also requires the original --backend, --model, --skill, and baseline --prompt / TG_PROMPT inputs to match. If you try to resume with different overrides, taskgrind rejects that mismatch explicitly so a resumed grind does not silently change direction.

Troubleshooting

Use this playbook when an unattended grind looks stuck, blocked, or noisy. Start with the status file when TG_STATUS_FILE is enabled, then confirm the same story in the log named in the startup banner.

Symptom Inspect Recovery
Queue looks stuck even though the process is alive current_phase in TG_STATUS_FILE; log lines containing queue_empty_wait, blocked_wait, or running_sweep If the phase is queue_empty_wait or blocked_wait, leave the grind running while another agent or operator refills or unblocks TASKS.md. If the repo should already have work, open TASKS.md and fix claimed/blocking entries instead of restarting immediately.
Another terminal says the repo is busy or a new worker will not start taskgrind --preflight ~/apps/myrepo for slots: N/M active; the active-slot owner list in preflight output; current_phase in TG_STATUS_FILE for the active worker Wait for a slot to free up, or raise TG_MAX_INSTANCES before starting another grind. Keep slot 0 as the sync owner; point higher slots at docs, audits, TASKS.md maintenance, or status-file supervision instead of overlapping code edits.
Sessions keep ending with zero shipped tasks last_session.result, last_session.shipped, and log markers such as productive_zero_ship, shipped_inferred, or repeated tasks_after= counts Read the last few session summaries before killing the run. If the queue is churning under another agent, taskgrind may still be shipping work. If the same task is being retried without progress, tighten the prompt, split the task, or remove the blocker in TASKS.md before resuming.
Network outages pause progress for too long current_phase=waiting_for_network; log lines around connectivity retries and network_restored Let taskgrind hold the deadline open during short outages. If the outage exceeds TG_NET_MAX_WAIT, restore connectivity first, then resume with the same repo plus the original startup overrides so the saved backend/model/skill/prompt contract still matches.
--resume refuses to continue The rejection message in stderr; .taskgrind-state; docs/resume-state.md for the saved field contract Fix the mismatch the message calls out: rerun with the same repo plus the same --backend, --model, --skill, and baseline --prompt / TG_PROMPT inputs, restore the missing state file, or start a fresh grind if the deadline already expired. Do not copy stale state across repos.
Final push or sync fails during shutdown The final git push / git pull --rebase lines in the log; git status --short; git log --oneline --decorate -5 Resolve the git problem in the repo first, usually with git pull --rebase for incoming changes or by fixing the rejected push target. Then rerun resume with the same repo plus the original startup overrides if the interrupted run did not use pure defaults.

Safe recovery loop:

  1. Read TG_STATUS_FILE to learn whether the grind is working, waiting, or failed.
  2. Tail the matching log file to confirm the latest session result and git state.
  3. If slot 0 is already active, keep later slots on supervision or other non-overlapping work until the sync lane is free.
  4. Run taskgrind --preflight ~/apps/myrepo before adding more workers or after clearing a blocker.
  5. Prefer taskgrind --resume ~/apps/myrepo after crashes, reboots, or push failures when the original run used the same startup defaults you still want. Otherwise repeat the original --backend, --model, --skill, and baseline --prompt / TG_PROMPT choices on the resume command so validation succeeds.
  6. If resume is rejected, retry with the original startup overrides or start a fresh run on purpose.

Development

make install    # symlink to /usr/local/bin + install man page
make audit      # run the local repo audit workflow
make lint       # shellcheck
make test       # bats test suite (cached, auto-capped parallelism)
make test-force # rerun the selected bats suite without cache
make test TESTS=tests/bash-compat.bats  # targeted rerun with its own cache key
make test TEST_JOBS=4  # override the auto-capped parallelism for diagnostics
make check      # lint + test
make uninstall  # remove symlink and man page

Requires: bats-core, shellcheck

Taskgrind runtime files must stay compatible with /bin/bash 3.2, and tests/verify-bash32-compat.sh is the guard that enforces that contract during the bats suite.

# macOS
brew install bats-core shellcheck

# Ubuntu / Debian
sudo apt-get update
sudo apt-get install -y npm shellcheck
sudo npm install -g bats

# Fedora / RHEL
sudo dnf install -y bats ShellCheck

On Linux, the supported bats install path is the npm flow above so local make check runs match the GitHub Actions CI environment.

History

Extracted from dotfiles where it lived as dvb-grind. The dvb-grind name still works as a shell alias in dotfiles for backward compatibility.

Docs

  • User Stories — real usage patterns with commands and sample output
  • Architecture — design decisions and rationale
  • Resume State — saved-state fields, validation rules, and restore behavior

License

MIT

About

Autonomous multi-session AI coding grind — runs sequential Devin sessions against TASKS.md until deadline

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors