Skip to content

fix: count caps, trim-by-date, upload-asset limit, & timeout#28916

Closed
Copilot wants to merge 2 commits intomainfrom
copilot/deep-dive-agentic-optimization-kit
Closed

fix: count caps, trim-by-date, upload-asset limit, & timeout#28916
Copilot wants to merge 2 commits intomainfrom
copilot/deep-dive-agentic-optimization-kit

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 28, 2026

Eight correctness bugs across agentic-optimization-kit.md, copilot-token-optimizer.md, and agentic-observability-kit.md — silent data truncation, a cooldown window that breaks under load, and a Monday schedule collision causing concurrent MCP rate-limit pressure.

agentic-optimization-kit.md

  • timeout 35 → 50 min
  • upload-asset declared (max: 5, allowed-exts: [.png]); Phase 4 uploads 5 charts but gateway limit was absent — silent drop on the 5th
  • Phase 3 count 400 → 1000; 400 runs exhaust in ~6 days on an active repo
  • Hardcoded example in optimization-log entry replaced with <TODAY> / <WORKFLOW_NAME> / <TOTAL_TOKENS> / etc. placeholders — the literal 2026-04-28 / daily-observability-report was today's date and copyable verbatim
  • Trim by date not count: "keep only last 30 entries" → keep entries newer than 30 days; count-based trim silently evicts entries still inside the 14-day cooldown window
  • Phase timing checkpointsdate logging at phase start/end to get empirical wall-clock data before any future timeout tuning
  • Jaccard — "Jaccard or cosine similarity" → "Jaccard similarity on token sets" for workflow_overlap_score
  • gh api example added for reading workflow source files:
    gh api repos/{owner}/{repo}/contents/.github/workflows/WORKFLOW_NAME.md \
      --jq '.content' | base64 -d

copilot-token-optimizer.md

  • jq null guardmap(.tokens) | add returns null on empty input; added // 0 after every add in the pre-aggregation script (AOK already had this fix)
  • Fetch count -c 50-c 200; 50 runs covers ~3 days, biasing target selection
  • Trim by date — same 30-day date-window fix as AOK

agentic-observability-kit.md

  • Schedule weekly on mondayweekly on tuesday — eliminates the concurrent Monday MCP logs call collision with agentic-optimization-kit at zero structural cost

- agentic-optimization-kit: raise timeout 35→50 min
- agentic-optimization-kit: add upload-asset max:5 allowed-exts:[.png] to safe-outputs
- agentic-optimization-kit: Phase 3 logs count 400→1000
- agentic-optimization-kit: replace hardcoded date/workflow in optimization-log example with placeholders
- agentic-optimization-kit: trim optimization-log by 30-day date window instead of 30-entry count
- agentic-optimization-kit: add Phase Timing Checkpoints section
- agentic-optimization-kit: specify Jaccard similarity (not "Jaccard or cosine")
- agentic-optimization-kit: add gh api example for reading workflow source
- copilot-token-optimizer: add // 0 null guard after jq add in pre-aggregation
- copilot-token-optimizer: fetch count -c 50 → -c 200
- copilot-token-optimizer: trim optimization-log by 30-day date window instead of count
- agentic-observability-kit: schedule monday → tuesday to stagger MCP rate-limit load

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/d5b6bfb2-47bf-4aa2-a24c-223cd4459d6a

Co-authored-by: mnkiefer <8320933+mnkiefer@users.noreply.github.com>
@mnkiefer mnkiefer marked this pull request as ready for review April 28, 2026 11:01
Copilot AI review requested due to automatic review settings April 28, 2026 11:01
@mnkiefer mnkiefer changed the title Fix AOK bugs: jq null propagation, count caps, hardcoded example, trim-by-date, upload-asset limit, timeout, and schedule collision fix: count caps, trim-by-date, upload-asset limit, & timeout Apr 28, 2026
@github-actions github-actions Bot mentioned this pull request Apr 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes multiple correctness and operational issues in the Agentic Optimization/Observability kits and the Copilot token optimizer workflows (null-safe jq aggregation, higher run caps, safer examples/log retention guidance, asset upload limits, timeout tuning, and schedule deconfliction).

Changes:

  • Increased gh aw logs fetch limits and made jq aggregations null-safe (add // 0) for Copilot token optimizer.
  • Updated Agentic Optimization Kit guardrails: timeout, upload-asset constraints, larger Phase 3 count, placeholder-based examples, date-window trimming guidance, and added phase timing checkpoints + workflow-source fetch example.
  • Moved Agentic Observability Kit weekly schedule from Monday to Tuesday to avoid collisions.
Show a summary per file
File Description
.github/workflows/copilot-token-optimizer.md Raises audit run cap, hardens jq aggregation, updates optimization-log retention guidance
.github/workflows/copilot-token-optimizer.lock.yml Regenerated workflow lock with updated log cap and jq aggregation changes
.github/workflows/agentic-optimization-kit.md Adds upload-asset limit + timeout increase, doc/guardrail refinements, phase timing guidance, placeholders
.github/workflows/agentic-optimization-kit.lock.yml Regenerated lock reflecting timeout and upload-asset restriction changes
.github/workflows/agentic-observability-kit.md Shifts weekly schedule to Tuesday
.github/workflows/agentic-observability-kit.lock.yml Regenerated lock reflecting new Tuesday cron schedule

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 6/6 changed files
  • Comments generated: 4

Comment on lines +226 to +229
To fetch a workflow file, run (replacing `WORKFLOW_NAME` with the actual filename without extension):

```bash
gh api repos/{owner}/{repo}/contents/.github/workflows/WORKFLOW_NAME.md \
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gh api repos/{owner}/{repo}/... example won’t work when copy/pasted unless {owner} and {repo} are also replaced. Consider either explicitly calling out that all placeholders must be replaced, or use an example that derives the repo from context (e.g., repos/$GITHUB_REPOSITORY/...) so only WORKFLOW_NAME needs editing.

Suggested change
To fetch a workflow file, run (replacing `WORKFLOW_NAME` with the actual filename without extension):
```bash
gh api repos/{owner}/{repo}/contents/.github/workflows/WORKFLOW_NAME.md \
To fetch a workflow file, run (replacing `WORKFLOW_NAME` with the actual filename without extension; the repository is derived from `$GITHUB_REPOSITORY`):
```bash
gh api "repos/$GITHUB_REPOSITORY/contents/.github/workflows/WORKFLOW_NAME.md" \

Copilot uses AI. Check for mistakes.
Comment on lines 83 to 86
]
| group_by(.workflow_name)
| map({
workflow_name: .[0].workflow_name,
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jq's group_by(.workflow_name) requires the input array to be sorted by the grouping key; otherwise runs with the same workflow name that are non-adjacent will end up in different groups and totals/averages will be wrong. Add a sort_by(.workflow_name) (or equivalent) immediately before group_by(.workflow_name) in this pipeline (as done in agentic-optimization-kit.md).

See below for a potential fix:

          ]
          | sort_by(.workflow_name)

Copilot uses AI. Check for mistakes.
run: "set -euo pipefail\nmkdir -p /tmp/gh-aw/token-audit\n\necho \"📥 Downloading Copilot workflow logs (last 7 days)...\"\n\nLOGS_EXIT=0\ngh aw logs \\\n --engine copilot \\\n --start-date -7d \\\n --json \\\n -c 200 \\\n > /tmp/gh-aw/token-audit/all-runs.json || LOGS_EXIT=$?\n\nif [ -s /tmp/gh-aw/token-audit/all-runs.json ]; then\n TOTAL=$(jq '.runs | length' /tmp/gh-aw/token-audit/all-runs.json)\n echo \"✅ Downloaded $TOTAL Copilot workflow runs (last 7 days)\"\n if [ \"$LOGS_EXIT\" -ne 0 ]; then\n echo \"⚠️ gh aw logs exited with code $LOGS_EXIT (partial results — likely API rate limit)\"\n fi\nelse\n echo \"❌ No log data downloaded (exit code $LOGS_EXIT)\"\n echo '{\"runs\":[],\"summary\":{}}' > /tmp/gh-aw/token-audit/all-runs.json\nfi\n"
- name: Pre-aggregate top workflows by token usage
run: "set -euo pipefail\nmkdir -p /tmp/gh-aw/token-audit\n\njq '{\n generated_at: (now | todateiso8601),\n window_days: 7,\n top_workflows: (\n [.runs[]\n | select(.status == \"completed\")\n | {\n workflow_name: .workflow_name,\n tokens: (.token_usage // 0),\n cost: (.estimated_cost // 0),\n turns: (.turns // 0),\n action_minutes: (.action_minutes // 0)\n }\n ]\n | group_by(.workflow_name)\n | map({\n workflow_name: .[0].workflow_name,\n run_count: length,\n total_tokens: (map(.tokens) | add),\n avg_tokens: ((map(.tokens) | add) / length),\n total_cost: (map(.cost) | add),\n total_turns: (map(.turns) | add),\n total_action_minutes: (map(.action_minutes) | add)\n })\n | sort_by(.total_tokens)\n | reverse\n | .[:10]\n )\n}' /tmp/gh-aw/token-audit/all-runs.json > /tmp/gh-aw/token-audit/top-workflows.json\n\necho \"✅ Generated top workflow summary at /tmp/gh-aw/token-audit/top-workflows.json\"\njq '.top_workflows' /tmp/gh-aw/token-audit/top-workflows.json\n"
run: "set -euo pipefail\nmkdir -p /tmp/gh-aw/token-audit\n\njq '{\n generated_at: (now | todateiso8601),\n window_days: 7,\n top_workflows: (\n [.runs[]\n | select(.status == \"completed\")\n | {\n workflow_name: .workflow_name,\n tokens: (.token_usage // 0),\n cost: (.estimated_cost // 0),\n turns: (.turns // 0),\n action_minutes: (.action_minutes // 0)\n }\n ]\n | group_by(.workflow_name)\n | map({\n workflow_name: .[0].workflow_name,\n run_count: length,\n total_tokens: (map(.tokens) | add // 0),\n avg_tokens: ((map(.tokens) | add // 0) / length),\n total_cost: (map(.cost) | add // 0),\n total_turns: (map(.turns) | add // 0),\n total_action_minutes: (map(.action_minutes) | add // 0)\n })\n | sort_by(.total_tokens)\n | reverse\n | .[:10]\n )\n}' /tmp/gh-aw/token-audit/all-runs.json > /tmp/gh-aw/token-audit/top-workflows.json\n\necho \"✅ Generated top workflow summary at /tmp/gh-aw/token-audit/top-workflows.json\"\njq '.top_workflows' /tmp/gh-aw/token-audit/top-workflows.json\n"
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-aggregation jq snippet uses group_by(.workflow_name) without first sorting by workflow_name. In jq, group_by only groups adjacent items, so totals can be incorrect unless the array is sorted by the same key first. Insert | sort_by(.workflow_name) before | group_by(.workflow_name) inside this jq program.

Suggested change
run: "set -euo pipefail\nmkdir -p /tmp/gh-aw/token-audit\n\njq '{\n generated_at: (now | todateiso8601),\n window_days: 7,\n top_workflows: (\n [.runs[]\n | select(.status == \"completed\")\n | {\n workflow_name: .workflow_name,\n tokens: (.token_usage // 0),\n cost: (.estimated_cost // 0),\n turns: (.turns // 0),\n action_minutes: (.action_minutes // 0)\n }\n ]\n | group_by(.workflow_name)\n | map({\n workflow_name: .[0].workflow_name,\n run_count: length,\n total_tokens: (map(.tokens) | add // 0),\n avg_tokens: ((map(.tokens) | add // 0) / length),\n total_cost: (map(.cost) | add // 0),\n total_turns: (map(.turns) | add // 0),\n total_action_minutes: (map(.action_minutes) | add // 0)\n })\n | sort_by(.total_tokens)\n | reverse\n | .[:10]\n )\n}' /tmp/gh-aw/token-audit/all-runs.json > /tmp/gh-aw/token-audit/top-workflows.json\n\necho \"✅ Generated top workflow summary at /tmp/gh-aw/token-audit/top-workflows.json\"\njq '.top_workflows' /tmp/gh-aw/token-audit/top-workflows.json\n"
run: "set -euo pipefail\nmkdir -p /tmp/gh-aw/token-audit\n\njq '{\n generated_at: (now | todateiso8601),\n window_days: 7,\n top_workflows: (\n [.runs[]\n | select(.status == \"completed\")\n | {\n workflow_name: .workflow_name,\n tokens: (.token_usage // 0),\n cost: (.estimated_cost // 0),\n turns: (.turns // 0),\n action_minutes: (.action_minutes // 0)\n }\n ]\n | sort_by(.workflow_name)\n | group_by(.workflow_name)\n | map({\n workflow_name: .[0].workflow_name,\n run_count: length,\n total_tokens: (map(.tokens) | add // 0),\n avg_tokens: ((map(.tokens) | add // 0) / length),\n total_cost: (map(.cost) | add // 0),\n total_turns: (map(.turns) | add // 0),\n total_action_minutes: (map(.action_minutes) | add // 0)\n })\n | sort_by(.total_tokens)\n | reverse\n | .[:10]\n )\n}' /tmp/gh-aw/token-audit/all-runs.json > /tmp/gh-aw/token-audit/top-workflows.json\n\necho \"✅ Generated top workflow summary at /tmp/gh-aw/token-audit/top-workflows.json\"\njq '.top_workflows' /tmp/gh-aw/token-audit/top-workflows.json\n"

Copilot uses AI. Check for mistakes.

## Phase Timing Checkpoints

At the start of every phase, log a timestamp with `echo "⏱️ Phase N start: $(date -u +%H:%M:%SZ)"`. At the end of each phase, log elapsed time with `echo "✅ Phase N complete: $(date -u +%H:%M:%SZ)"`. This helps identify which phases consume the most wall-clock time so future timeouts can be sized accurately.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says to "log elapsed time" at phase end, but the suggested command logs only another timestamp (no duration). Either adjust the wording to “log a timestamp” or change the guidance to actually compute and print elapsed time (e.g., capture a start epoch/SECONDS and subtract at phase end).

Suggested change
At the start of every phase, log a timestamp with `echo "⏱️ Phase N start: $(date -u +%H:%M:%SZ)"`. At the end of each phase, log elapsed time with `echo "✅ Phase N complete: $(date -u +%H:%M:%SZ)"`. This helps identify which phases consume the most wall-clock time so future timeouts can be sized accurately.
At the start of every phase, log a timestamp with `echo "⏱️ Phase N start: $(date -u +%H:%M:%SZ)"`. At the end of each phase, log an end timestamp with `echo "✅ Phase N complete: $(date -u +%H:%M:%SZ)"`. These checkpoints help identify which phases consume the most wall-clock time by comparing start and end times so future timeouts can be sized accurately.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants