Skip to content

[aw-failures] [aw] Auto-Triage Issues: agent recurrently aborts with permission-denied on /tmp/gh-aw/agent script writes #33560

@github-actions

Description

@github-actions

Summary

The Auto-Triage Issues workflow has produced two consecutive failed runs in the last hour, both ending with the agent attempting to write/execute a Python processing script in /tmp/gh-aw/agent/, hitting Permission denied and could not request permission from user, and falling back to a noop safe output. No labels are applied on these runs — the workflow's purpose is silently lost.

Prior tracking issue #33422 was closed (auto) 8 minutes after the next failure of the same family occurred — the symptom is not resolved, only the tracking issue rotated.

Affected runs (last 6h)

Run ID Workflow Duration Engine output
§26165337753 Auto-Triage Issues 8.2m Agent attempted mkdir -p /tmp/gh-aw/agent && cat > /tmp/gh-aw/agent/process_issues.py <<'PY' ... → permission denied → graceful noop
§26135471197 Auto-Triage Issues 33s Agent attempted mkdir -p /tmp/gh-aw/agent → permission denied → engine terminated (tracked at #33422)

Root cause

The agent's allow-list for Auto-Triage Issues is restricted to bash: ["jq *", "cat *"] (see .github/workflows/auto-triage-issues.md:35-37), expanded by the harness to a small set of read-only shell verbs plus safeoutputs:*. Specifically missing: mkdir, python3, and the ability to spawn arbitrary scripts.

The prompt body encourages the agent to "Fetch ALL open issues without any labels ... do not limit to a fixed count" while the pre-step only fetches the first 30 (auto-triage-issues.md:44). The agent therefore reaches for a Python script to paginate and process — which the sandbox correctly denies.

Audit-diff evidence

audit-diff vs prior failure (26135471197 → 26165337753)
  • Firewall: 0 new/removed domains, 0 anomalies — no network drift.
  • Same engine (copilot, gpt-5-mini), same workflow, same failure surface.
  • Token usage delta is purely due to how far the agent got before being blocked.

Proposed remediation (pick one)

  1. Tighten the prompt (recommended, smallest blast radius): explicitly instruct the agent to use only gh, jq, and safeoutputs add_labels — and to operate purely on the pre-staged /tmp/gh-aw/agent/unlabeled-issues.json. Remove the "fetch ALL" wording or rework it to read pages from the pre-staged file.
  2. Make the pre-step authoritative: expand the Fetch unlabeled issues step to paginate fully (and respect per_page), so the agent never needs to reach for scripting to scale up.
  3. (Avoid unless 1+2 fail) Add shell(mkdir) and a constrained Python tool to the allow-list. This widens the sandbox and should be a last resort.

Success criteria

  • Two consecutive scheduled runs of Auto-Triage Issues complete with at least one add_labels safe-output call (or a justified noop where no unlabeled issues exist).
  • No Permission denied and could not request permission from user in agent-stdio.log.
  • Auto-tracker issues for Auto-Triage Issues stop being re-opened within 6h of being closed.

Cross-references

References:

Generated by 🔍 [aw] Failure Investigator (6h) · ● 12.2M ·

  • expires on May 27, 2026, 1:49 PM UTC

Closing as stale — root cause no longer reproduces

A 6-hour failure investigation (window ending 2026-05-20T19:35Z) found that the symptom this issue tracks — Auto-Triage Issues agent aborting with Permission denied on /tmp/gh-aw/agent script writes — is no longer present in current failed runs.

Evidence from recent Auto-Triage Issues failures

Run ID Agent outcome Permission-denied seen? Where the job actually fails
§26184375104 Graceful noop after gh search returned HTTP 403 No Post-agent step Parse MCP Gateway logs for step summaryERR_SYSTEM: rpc-messages.jsonl is present but zero bytes
§26181990934 add_labels succeeded (["enhancement","workflows"] applied to #33609) No Same MCP telemetry post-step
Why the original symptom is gone
  • No mkdir, python3, or process_issues.py attempts appear in either agent-stdio.log.
  • agent_output.json for run 26181990934 shows a real add_labels action (not a permission-denied fallback).
  • The remediation noted in the original report ("tighten the prompt to use only gh, jq, and safeoutputs add_labels on the pre-staged file") appears to have landed.

A different failure now blocks Auto-Triage Issues

The MCP telemetry capture failure that now fails the agent job is a new, separate regression affecting multiple workflows (Auto-Triage Issues, Contribution Check). It is being tracked in a fresh investigation report — see the parent issue created in this run.

References:

Generated by 🔍 [aw] Failure Investigator (6h) · ● 27.2M ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions