Skip to content

Tandem v0.5.6

Choose a tag to compare

@github-actions github-actions released this 15 May 23:09
· 307 commits to main since this release
c217ad3

See the assets below to download the installer for your platform.

v0.5.6 (2026-05-14)

AI Evaluation Framework

This release ships the complete AI Evaluation Framework (Phases 1-5): a production-ready system for structured testing, regression detection, and compliance documentation of AI quality. The framework enables automated quality gates that prevent AI performance regressions from reaching production, provide quantitative proof of AI safety practices for compliance audits (e.g., EU AI Act Article 50 transparency), and make it easy for teams to experiment with new AI features while maintaining quality baseline.

What ships now:

  • Failure Mode Taxonomy (AIFailureMode enum with 30+ variants): Formalized categorization of AI failures across validation (artifact validation, contract violations, citation missing, web sources missing), provider (timeout, rate limit, model not found, stream failures), repair (budget exhausted, loop detection, unavailable), resource (token exhausted, cost exhausted, memory exhausted), authorization (failed, invalid API key, permission denied), and system-level (config error, dependency missing, feature disabled) domains. Each failure is classified into a severity category (Critical, High, Medium, Low) for incident response prioritization. Includes deterministic classifiers (classify_error_text(), categorize_failure(), should_retry()) for automatically tagging failures from real error text.

  • Evaluation Dataset Format (YAML/JSON): Structured EvalDataset schema with EvalTestCase definitions, AutomationSpecTest node specifications, and configurable EvalExpectedOutput validation criteria. Four example datasets ship in eval_datasets/: critical_path.yaml (happy-path scenarios for core features), provider_failures.yaml (resilience tests for timeouts and rate limits), repair_exhaustion.yaml (budget depletion edge cases), and citation_validation.yaml (web research validation). Dataset format is version-controlled and extensible for adding new test scenarios.

  • Eval Runner CLI (cargo run --bin eval-runner): Standalone command-line tool for bulk test execution with the following capabilities:

    • Metrics aggregation: Computes pass_rate, avg_repair_iterations, total_cost_usd, avg_cost_per_test, provider_failure_rate, and per-validator-class pass rates from test results
    • Simulation mode (--simulation): Deterministic test execution without making AI provider calls, making evals safe to run in CI and cost-free during development
    • Parallel workers (--num-workers): Configurable parallelism for faster test suite runs
    • Tag filtering (--filter-tag): Run only tests matching a specific tag (e.g., regression, happy_path)
    • CLI arguments: --dataset (required), --output, --provider, --model, --max-duration, --verbose
    • Output format: Human-readable summary on stdout + structured JSON to file for programmatic CI consumption
    • Exit codes: 0 (all tests passed), 1 (one or more test failures), 2 (dataset load error or invalid arguments)
  • Regression Detection System: Baseline comparison infrastructure that prevents AI quality degradation. EvalBaseline stores snapshots of evaluation metrics with git commit/branch metadata. detect_regressions() function compares current run metrics against a saved baseline using configurable thresholds (default: 5 percentage point pass_rate drop, 20% cost increase, 30% repair iteration increase, 5 percentage point provider failure increase). Results are reported as RegressionStatus (Pass / Warning / Regression) with human-readable messages explaining the deltas.

  • Regression Detection CI Gate (.github/workflows/eval-regression-gate.yml): GitHub Actions workflow that:

    • Triggers on every PR against main/develop and on main branch pushes
    • Builds and runs eval-runner against critical_path.yaml dataset
    • Compares results to eval_baselines/main_branch.json baseline
    • Posts a summary comment on the PR with pass rate and test counts
    • Fails the check if any regression threshold is exceeded (protecting main from quality degradation)
    • Auto-updates the baseline on successful main branch push so future PR comparisons use the latest production metrics
    • Uploads eval results as CI artifact for debugging and audit trail
  • Developer Documentation (docs/dev/EVAL_FRAMEWORK.md, ~500 lines): Comprehensive guide including:

    • Quick start with CLI examples
    • Architecture overview and data flow diagrams
    • Step-by-step guide for adding new eval datasets
    • Threshold customization patterns
    • Failure mode taxonomy reference
    • Running tests locally and in CI
    • Troubleshooting common issues (model differences, rate limits, nondeterminism)
    • FAQ covering simulation mode, parallelization, validators, costs, baseline updates
  • User & Compliance Documentation (docs/user/AI_QUALITY_ASSURANCE.md, ~350 lines): Non-technical guide covering:

    • Core quality metrics explained for end users (pass rate, repair iterations, cost, provider reliability)
    • Test scenarios described (happy path, edge cases, regression tests)
    • Quality assurance process (automated regression detection, continuous monitoring, failure analysis)
    • EU AI Act Article 50 compliance: Explicit mapping of Tandem's testing practices to transparency requirements for natural persons interacting with the AI system
    • Trust model: What Tandem can and cannot guarantee, limitations of AI systems
    • Incident response: How issues are detected, investigated, and resolved
    • Support contacts and glossary

Quality Gate Features

The regression-detection workflow is designed to be low-friction and CI-friendly:

  • Default thresholds are conservative but tunable per team/metric
  • Simulation mode means evaluations complete in seconds with zero API cost
  • Pass/Warning/Regression statuses provide clear go/no-go signals for CI gates
  • JSON output integrates with custom CI/CD tooling for advanced use cases
  • PR comments make quality trends visible to the whole team

EU AI Act Article 50 Compliance

This framework explicitly supports Article 50 transparency obligations for hosted Tandem services:

  • Documented AI system: The framework demonstrates how AI components are systematically tested
  • Quality assurance: Automated gates and metrics prove ongoing quality practices
  • Failure categorization: 30+ failure types enable post-mortem root-cause analysis
  • Performance tracking: Metrics track AI quality before, during, and after deployment
  • Regression prevention: Automated safeguards block degradations from reaching users
  • Audit trail: All test results are timestamped and logged for compliance audits

Users and auditors can request detailed evaluation metrics, regression reports, and failure analysis logs via support channels.


Security

  • Authorization fix for approval gates: Slack, Discord, and Telegram approval interactions now fail closed unless the acting user resolves through the configured channel allowlist before approval, rework, or cancel decisions are processed.

  • TOCTOU race condition fix in automation run cache: Automation run state reloads now detect concurrent in-memory updates before accepting disk-loaded state, preventing stale cache loads from overwriting gate decisions or duplicating execution.

  • Path traversal protection for automation identifiers: Automation definition and run-history paths now sanitize identifier-derived filenames and verify resolved paths stay inside their intended state roots.

  • Dedup TTL for webhook replay prevention: Discord and Slack interaction deduplication now uses a bounded retry window, reducing stale replay risk while preserving normal platform retry handling.

  • File permission validation on startup: Startup now warns when sensitive state files have overly broad Unix permissions so operators can tighten local storage access.

  • Discord modal identifier validation: Discord rework modal submissions now reject malformed or incomplete identifiers before any gate decision is dispatched.

  • Telegram dedup TTL implementation: Telegram approval callbacks now use the same retry-window deduplication model as Discord and Slack, reducing stale callback replay risk.

  • User ID extraction: reject instead of default: Channel approval handlers now reject malformed requests without a resolvable acting user instead of assigning a placeholder identity.

  • Reason field size validation: Discord rework feedback is now bounded server-side before being stored with gate decisions.

  • Error message information disclosure prevention: Public channel rejection messages now use generic denial text while retaining detailed audit logs for operators.

  • JWT structure and algorithm validation: Codex identity token parsing now validates token shape, header presence, allowed algorithm behavior, and signature encoding before processing claims.

  • JSON merge recursion depth limit: Provider configuration merging now enforces a maximum nesting depth to avoid stack exhaustion on deeply nested input.

  • CODEX_HOME path validation: Codex CLI home resolution now rejects unsafe or system-sensitive paths and falls back to the default home directory with a warning.

  • JWT token expiration validation: Codex identity resolution now rejects tokens without valid expiration claims and bounds-checks expiration timestamps before time arithmetic.

  • Approval card delivery fan-out: Slack, Discord, and Telegram channel adapters now support native interactive card sends. Approval requests can render as Block Kit messages, Discord embeds with components, or Telegram inline-keyboard messages instead of plain text fallbacks.

  • Slack approval card lifecycle updates: The server records delivered approval message handles in approval_message_map.json and best-effort edits Slack approval cards after approve, rework, or cancel decisions so stale buttons disappear and operators see the final decision inline.

  • Shared automation gate state helpers: Automation V2 gate pause and decision mutations now route through shared state helpers. The executor uses pause_automation_run_for_gate when an approval node blocks downstream work, and the gate-decision HTTP handler uses apply_automation_gate_decision for approve/rework/cancel state transitions, preserving the existing race-safe one-winner behavior.

  • Per-step approval override controls: Workflow edit prompts now let operators keep default approval, set conditional auto-approval metadata, or skip approval for an individual step with confirmation. The saved node metadata drives the compiler's existing approval-skip hook and clears stale injected gates on skipped steps.

  • Telegram approval rework completion: Telegram approval cards now prefer persisted opaque callback IDs so long run/node identifiers do not rely on unsafe truncation. Rework button taps send a force-reply prompt, capture the operator's next valid reply for that chat/user, and dispatch it as a rework gate decision with feedback.

  • Threaded approval status replies: Slack, Discord, and Telegram adapters now share a thread-reply primitive. Approval decisions update the original card and post a short status reply into the stored native thread/topic target when one is available.

  • Channel command capability tiers: Slash commands now carry read/act/approve/reconfigure tiers, and dispatcher execution checks the required tier against the channel security profile. Read contexts can inspect status without gaining approval or reconfiguration powers.

  • Persisted channel user capabilities: Tandem now has channel_user_capabilities.json state for explicit per-channel user capability assignments. Missing users fall back to the channel profile tier until enrollment binds them to a higher tier.

  • Channel enrollment pairing codes: POST /channels/enroll can issue a short-lived pairing code and confirm it out-of-band to bind a Slack, Discord, or Telegram user ID to a persisted capability tier. Approval button handlers now check the resolved user's tier and require Approve or higher before deciding a gate.

  • Channel outbound redaction: Dispatcher replies now pass through a shared redaction filter before Slack, Discord, or Telegram sends. The filter replaces common secret patterns, private-key markers, JWTs, and absolute paths outside the workspace root while preserving markdown structure; deployments can add regexes with TANDEM_CHANNEL_REDACTION_PATTERNS_FILE.

  • Per-user channel rate limiting: Tandem now applies per-user token buckets to channel-origin prompts and approval decisions. Prompts default to 10/minute, decisions default to 30/minute, limits are keyed by (channel, user_id), profile-specific env overrides are supported, and rejected requests return 429 Too Many Requests with Retry-After.

  • Workspace pinning for channel sessions: Channel sessions now carry a pinned workspace boundary. New channel-created sessions pin to the server workspace, enrollment records can preserve an explicit pinned_workspace_id, and file tools are denied with ToolDenied { reason: WorkspaceScope } if a channel session tries to read or write outside the pinned workspace.

  • Streaming audit export: GET /audit/stream now exposes an admin-gated newline-delimited JSON feed for external SIEM-style consumers. The stream normalizes approval decisions, tool execution ledger records, and channel capability changes into records with actor, command, workspace, tool call, result, timestamp, and channel fields where available.

  • Step-up confirmation for channel reconfiguration: Reconfigure-tier slash commands now require a fresh second-surface confirmation before execution. The dispatcher blocks /providers, /model, /schedule, /automations, and /config with a "step-up required" response unless the chat message carries a desktop-issued PIN from the last 5 minutes. The PIN token is removed before command parsing so it is not treated as a model id, schedule prompt, or config argument.

  • Dispatcher baseline cleanup: Channel dispatcher tests now match the registry-driven help output and concrete operator tool allowlist behavior, keeping the approval-channel test suite aligned with the current dispatcher contract.

Full Changelog: v0.5.5...v0.5.6