feat: report agent failure in OTEL conclusion span#24650
Conversation
…span Agent-Logs-Url: https://github.com/github/gh-aw/sessions/34d6377a-59ca-457b-a09e-529a6c43d6d3 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
- `buildOTLPPayload` now accepts optional `statusCode` and `statusMessage`
parameters (defaults to STATUS_CODE_OK/1)
- `sendJobConclusionSpan` reads `GH_AW_AGENT_CONCLUSION` env var and:
- Adds `gh-aw.agent.conclusion` span attribute with the conclusion value
- Sets span status to STATUS_CODE_ERROR (code 2) when conclusion is
"failure" or "timed_out"
- Includes status message "agent failure" / "agent timed_out"
- 5 new tests added covering all conclusion scenarios
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/34d6377a-59ca-457b-a09e-529a6c43d6d3
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates OTLP span emission so the job conclusion span reflects agent failures/timeouts in OpenTelemetry (status code + message) instead of always reporting OK.
Changes:
- Extend OTLP payload builder to accept an explicit span status code/message.
- Set conclusion span status based on
GH_AW_AGENT_CONCLUSION, and addgh-aw.agent.conclusionattribute. - Add tests for all agent conclusion scenarios.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/send_otlp_span.cjs | Adds dynamic OTLP span status and includes agent conclusion as a span attribute. |
| actions/setup/js/action_conclusion_otlp.cjs | Documents the new GH_AW_AGENT_CONCLUSION env var behavior. |
| actions/setup/js/action_otlp.test.cjs | Adds tests validating OTLP status/attributes for each conclusion outcome. |
| .github/workflows/hourly-ci-cleaner.lock.yml | Updates prompt/frontmatter extraction and threat-detection CLI debug logging configuration. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 4/4 changed files
- Comments generated: 3
| const code = typeof statusCode === "number" ? statusCode : 1; // STATUS_CODE_OK | ||
| /** @type {{ code: number, message?: string }} */ | ||
| const status = { code }; | ||
| if (statusMessage) { |
There was a problem hiding this comment.
buildOTLPPayload adds status.message whenever statusMessage is truthy, but the new JSDoc says the message is only included when statusCode is 2 (ERROR). Either gate status.message on code === 2 (and optionally ignore/clear message for OK/UNSET) or update the JSDoc to match the actual behavior.
| if (statusMessage) { | |
| if (code === 2 && statusMessage) { |
| AGENT_CONTENT="$(awk 'NR==1 && /^---$/ { in_frontmatter=1; next } in_frontmatter && /^---$/ { in_frontmatter=0; next } !in_frontmatter { print }' "${GITHUB_WORKSPACE}/.github/agents/ci-cleaner.agent.md")" | ||
| # Combine agent content with prompt | ||
| PROMPT_TEXT="$(printf '%s\n\n%s' "$AGENT_CONTENT" "$(cat /tmp/gh-aw/aw-prompts/prompt.txt)")" | ||
| AGENT_CONTENT="$(awk 'BEGIN{skip=1} /^---$/{if(skip){skip=0;next}else{skip=1;next}} !skip' "${GITHUB_WORKSPACE}/.github/agents/ci-cleaner.agent.md")" |
There was a problem hiding this comment.
The updated awk frontmatter-stripping logic appears inverted: with BEGIN{skip=1} it prints the YAML frontmatter (lines between the first and second ---) and skips the actual markdown body. For .github/agents/ci-cleaner.agent.md (which starts with a frontmatter block), this will result in the prompt missing the agent instructions. Initialize skip so that content is printed after the closing --- (or revert to the prior frontmatter logic).
| AGENT_CONTENT="$(awk 'BEGIN{skip=1} /^---$/{if(skip){skip=0;next}else{skip=1;next}} !skip' "${GITHUB_WORKSPACE}/.github/agents/ci-cleaner.agent.md")" | |
| AGENT_CONTENT="$(awk 'BEGIN{skip=0} /^---$/{if(skip){skip=0;next}else{skip=1;next}} !skip' "${GITHUB_WORKSPACE}/.github/agents/ci-cleaner.agent.md")" |
| # shellcheck disable=SC1003 | ||
| sudo -E awf --container-workdir "${GITHUB_WORKSPACE}" --mount "${RUNNER_TEMP}/gh-aw:${RUNNER_TEMP}/gh-aw:ro" --mount "${RUNNER_TEMP}/gh-aw:/host${RUNNER_TEMP}/gh-aw:ro" --tty --env-all --exclude-env ANTHROPIC_API_KEY --allow-domains '*.githubusercontent.com,anthropic.com,api.anthropic.com,api.github.com,api.snapcraft.io,archive.ubuntu.com,azure.archive.ubuntu.com,cdn.playwright.dev,codeload.github.com,crl.geotrust.com,crl.globalsign.com,crl.identrust.com,crl.sectigo.com,crl.thawte.com,crl.usertrust.com,crl.verisign.com,crl3.digicert.com,crl4.digicert.com,crls.ssl.com,files.pythonhosted.org,ghcr.io,github-cloud.githubusercontent.com,github-cloud.s3.amazonaws.com,github.com,host.docker.internal,json-schema.org,json.schemastore.org,keyserver.ubuntu.com,lfs.github.com,objects.githubusercontent.com,ocsp.digicert.com,ocsp.geotrust.com,ocsp.globalsign.com,ocsp.identrust.com,ocsp.sectigo.com,ocsp.ssl.com,ocsp.thawte.com,ocsp.usertrust.com,ocsp.verisign.com,packagecloud.io,packages.cloud.google.com,packages.microsoft.com,playwright.download.prss.microsoft.com,ppa.launchpad.net,pypi.org,raw.githubusercontent.com,registry.npmjs.org,s.symcb.com,s.symcd.com,security.ubuntu.com,sentry.io,statsig.anthropic.com,ts-crl.ws.symantec.com,ts-ocsp.ws.symantec.com' --log-level info --proxy-logs-dir /tmp/gh-aw/sandbox/firewall/logs --audit-dir /tmp/gh-aw/sandbox/firewall/audit --enable-host-access --image-tag 0.25.13 --skip-pull --enable-api-proxy \ | ||
| -- /bin/bash -c 'export PATH="$(find /opt/hostedtoolcache -maxdepth 4 -type d -name bin 2>/dev/null | tr '\''\n'\'' '\'':'\'')$PATH"; [ -n "$GOROOT" ] && export PATH="$GOROOT/bin:$PATH" || true && claude --print --disable-slash-commands --no-chrome --allowed-tools Bash,BashOutput,ExitPlanMode,Glob,Grep,KillBash,LS,NotebookRead,Read,Task,TodoWrite --debug-file /tmp/gh-aw/threat-detection/detection.debug.log --verbose --permission-mode bypassPermissions --output-format stream-json "$(cat /tmp/gh-aw/aw-prompts/prompt.txt)"${GH_AW_MODEL_DETECTION_CLAUDE:+ --model "$GH_AW_MODEL_DETECTION_CLAUDE"}' 2>&1 | tee -a /tmp/gh-aw/threat-detection/detection.log | ||
| -- /bin/bash -c 'export PATH="$(find /opt/hostedtoolcache -maxdepth 4 -type d -name bin 2>/dev/null | tr '\''\n'\'' '\'':'\'')$PATH"; [ -n "$GOROOT" ] && export PATH="$GOROOT/bin:$PATH" || true && claude --print --disable-slash-commands --no-chrome --allowed-tools Bash,BashOutput,ExitPlanMode,Glob,Grep,KillBash,LS,NotebookRead,Read,Task,TodoWrite --debug-file /tmp/gh-aw/threat-detection/detection.log --verbose --permission-mode bypassPermissions --output-format stream-json "$(cat /tmp/gh-aw/aw-prompts/prompt.txt)"${GH_AW_MODEL_DETECTION_CLAUDE:+ --model "$GH_AW_MODEL_DETECTION_CLAUDE"}' 2>&1 | tee -a /tmp/gh-aw/threat-detection/detection.log |
There was a problem hiding this comment.
--debug-file is set to /tmp/gh-aw/threat-detection/detection.log, which is also the file being written via tee -a .../detection.log. Having the CLI write its debug output to the same file as the streamed stdout/stderr can interleave/corrupt logs and make debugging/parsing unreliable. Consider restoring a separate debug file path (e.g., detection.debug.log) or removing --debug-file if not needed.
| -- /bin/bash -c 'export PATH="$(find /opt/hostedtoolcache -maxdepth 4 -type d -name bin 2>/dev/null | tr '\''\n'\'' '\'':'\'')$PATH"; [ -n "$GOROOT" ] && export PATH="$GOROOT/bin:$PATH" || true && claude --print --disable-slash-commands --no-chrome --allowed-tools Bash,BashOutput,ExitPlanMode,Glob,Grep,KillBash,LS,NotebookRead,Read,Task,TodoWrite --debug-file /tmp/gh-aw/threat-detection/detection.log --verbose --permission-mode bypassPermissions --output-format stream-json "$(cat /tmp/gh-aw/aw-prompts/prompt.txt)"${GH_AW_MODEL_DETECTION_CLAUDE:+ --model "$GH_AW_MODEL_DETECTION_CLAUDE"}' 2>&1 | tee -a /tmp/gh-aw/threat-detection/detection.log | |
| -- /bin/bash -c 'export PATH="$(find /opt/hostedtoolcache -maxdepth 4 -type d -name bin 2>/dev/null | tr '\''\n'\'' '\'':'\'')$PATH"; [ -n "$GOROOT" ] && export PATH="$GOROOT/bin:$PATH" || true && claude --print --disable-slash-commands --no-chrome --allowed-tools Bash,BashOutput,ExitPlanMode,Glob,Grep,KillBash,LS,NotebookRead,Read,Task,TodoWrite --debug-file /tmp/gh-aw/threat-detection/detection.debug.log --verbose --permission-mode bypassPermissions --output-format stream-json "$(cat /tmp/gh-aw/aw-prompts/prompt.txt)"${GH_AW_MODEL_DETECTION_CLAUDE:+ --model "$GH_AW_MODEL_DETECTION_CLAUDE"}' 2>&1 | tee -a /tmp/gh-aw/threat-detection/detection.log |
Summary
Reports the agent job conclusion status through OpenTelemetry when OTLP is enabled. Previously, OTEL conclusion spans always recorded
STATUS_CODE_OKregardless of whether the agent job succeeded or failed.Changes
actions/setup/js/send_otlp_span.cjsbuildOTLPPayload: Accepts optionalstatusCode(defaults to 1/OK) andstatusMessageparameters, allowing callers to set the OTLP span status dynamically.sendJobConclusionSpan: Now readsGH_AW_AGENT_CONCLUSION(already set in the conclusion job environment) and:gh-aw.agent.conclusionspan attribute with the raw conclusion value ("success","failure","timed_out","cancelled","skipped")STATUS_CODE_ERROR(code 2) when the conclusion is"failure"or"timed_out""agent failure"/"agent timed_out") in those error casesactions/setup/js/action_conclusion_otlp.cjsGH_AW_AGENT_CONCLUSIONenvironment variable.actions/setup/js/action_otlp.test.cjs"failure"→ STATUS_CODE_ERROR +gh-aw.agent.conclusionattribute"timed_out"→ STATUS_CODE_ERROR +gh-aw.agent.conclusionattribute"success"→ STATUS_CODE_OK +gh-aw.agent.conclusionattribute"cancelled"→ STATUS_CODE_OK (cancelled is not an error) + attributegh-aw.agent.conclusionattribute