Skip to content

compute_text step strips all non-GitHub URLs from issue/PR/discussion bodies before the agent sees them #27638

@corygehr

Description

@corygehr

Summary

In v0.69.0, the Compute current body text step (id: sanitized, runs compute_text.cjs) redacts every URL in the triggering event's title/body whose hostname isn't on a hardcoded default allow-list — because the compiler never passes the workflow's allowed-domain configuration to that step. As a result, any URL a user pastes into an issue body (news articles, market pages, documentation, etc.) arrives at the agent as <domain>/redacted, even when:

  • the domain is listed under network.allowed in the workflow frontmatter, and
  • the domain is listed under safe-outputs.allowed-domains, and
  • the domain appears in the GH_AW_ALLOWED_DOMAINS env var that the compiler does wire up on the downstream output-ingest step.

This effectively breaks any workflow whose whole point is to ingest a user-supplied URL (triage, research, summarization, "explain this article", etc.).

Impact

  • Any workflow that trusts a URL from an issues, issue_comment, pull_request, discussion, or discussion_comment event body.
  • Users pasting URLs from Reuters, AP, BBC, Bloomberg, or any other legitimate, network-allow-listed source will see the agent fetch <that-domain>/redacted, which 404s. The agent has no signal that the URL was altered, so it often concludes the user submitted a broken link and closes/escalates incorrectly.
  • Workarounds require hand-editing the compiled .lock.yml, which is not durable across recompiles.

Expected Behavior

The compute_text / sanitized step should receive the same GH_AW_ALLOWED_DOMAINS value that the compiler already computes for the output-ingest step — i.e. the union of the engine/network base set, network.allowed, and safe-outputs.allowed-domains. Incoming-text sanitization and outgoing-content sanitization should apply the same allow-list; otherwise the two sides of the pipeline disagree about which domains are "known good."

Actual Behavior

In pkg/workflow/compiler_activation_job_builder.go (the NeedsTextOutput branch, around the step titled Compute current body text), the only env var emitted is GH_AW_ALLOWED_BOTS (when data.Bots is populated). GH_AW_ALLOWED_DOMAINS is not set, so at runtime sanitize_content_core.cjs#buildAllowedDomains falls back to the hardcoded default:

github.com, github.io, githubusercontent.com, githubassets.com, github.dev, codespaces.new

Any URL whose host is not on that list is rewritten to (<sanitized-domain>/redacted) by sanitizeUrlDomains before the text ever reaches the prompt construction step. The redaction is logged to /tmp/gh-aw/redacted-urls.log, but the agent itself has no awareness of which URLs were altered.

Reproduction

  1. Create a workflow that triggers on issues: [opened, labeled], declares network.allowed including some external domain (e.g. cnn.com), and instructs the agent to fetch the URL from the issue body.
  2. Compile with gh aw compile.
  3. Open an issue whose body contains https://cnn.com/some-article.
  4. Inspect the agent transcript — the URL the agent received is cnn.com/redacted, and any fetch call against it returns 404.
  5. Inspect the compiled .lock.yml — the id: sanitized step has no env: block (or only GH_AW_ALLOWED_BOTS), while the downstream collect_output step does have the full GH_AW_ALLOWED_DOMAINS value. The two are inconsistent.

Suggested Fix

In the NeedsTextOutput branch of the activation-job builder, emit the same GH_AW_ALLOWED_DOMAINS env var that generateOutputCollectionStep already emits (the value produced by computeExpandedAllowedDomainsForSanitization / computeAllowedDomainsForSanitization). That single line reuses logic that already exists and brings the incoming-text sanitizer into line with the outgoing-content sanitizer.

Optional follow-ups that would also be nice:

  • Surface a warning in compute_text.cjs when redactions occur (e.g. one-line summary to the step summary), so authors notice when user-supplied URLs are being stripped.
  • Document the incoming-text sanitizer and its relationship to network.allowed and safe-outputs.allowed-domains — today the only mention of sanitization in the docs is on the output side, so authors reasonably assume network.allowed covers "URLs my agent is allowed to read."

Workaround

Until this is fixed, workflows that need to accept user-supplied URLs must either:

  • post-process steps.sanitized.outputs.text in a custom step (fragile), or
  • instruct users to put the URL somewhere the sanitizer doesn't touch — e.g. as a label, a custom field, or a structured issue-form dropdown — which defeats the point of a free-form URL field.

Neither is a good long-term answer; the right fix is to thread the allow-list through to the sanitize step.

Environment

  • gh-aw v0.69.0
  • Runner: ubuntu:24.04
  • Engine: Copilot (reproduces regardless of engine — the sanitizer runs before the engine dispatch)

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions