fix(dictation): @copilot everywhere, NLP histogram step, 256-term glossary by Copilot · Pull Request #28572 · github/gh-aw

Copilot · 2026-04-26T10:42:38Z

The dictation skill and its generator workflow had ~1000 undifferentiated terms, no data-driven term selection, and used bare copilot instead of @copilot.

`skills/dictation/SKILL.md`

Glossary trimmed from ~1000 → exactly 256 high-frequency, domain-curated terms
Standalone copilot entry removed; @copilot was already present
Speech-to-text rule corrected: "co-pilot" → @copilot

`.github/workflows/dictation-prompt.md`

New Step 1 — precomputed Python NLP word-frequency histogram; agent uses it as the primary source for term selection rather than guessing:

python3 - <<'EOF'
import re
from pathlib import Path
from collections import Counter
docs = Path("docs/src/content/docs")
tokens = Counter()
for md_file in docs.rglob("*.md"):
    text = md_file.read_text(errors="replace")
    tokens.update(re.findall(r'`([^`\n]+)`', text))
    tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))
for tok, n in tokens.most_common(500):
    if len(tok) > 2:
        print(f"  {n:5d}  {tok}")
EOF

Old steps renumbered (1→2, 2→3, 3→4)
Engine reference copilot → @copilot in focus-area list
Term target updated to 256 (240–270 acceptable) throughout mission, guidelines, and success criteria

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

https://api.github.com/graphql
- Triggering command: /usr/bin/gh /usr/bin/gh api graphql -f query=query($owner: String!, $name: String!) { repository(owner: $owner, name: $name) { hasDiscussionsEnabled } } -f owner=github -f name=gh-aw (http block)
- Triggering command: /usr/bin/gh /usr/bin/gh api graphql -f query=query($owner: String!, $name: String!) { repository(owner: $owner, name: $name) { hasDiscussionsEnabled } } -f owner=github -f name=gh-aw 7Y47dTp/WwuDpz60-1 (http block)
https://api.github.com/repos/astral-sh/setup-uv/git/ref/tags/eac588ad8def6316056a12d4907a9d4d84ff7a3b
- Triggering command: /usr/bin/gh gh api /repos/astral-sh/setup-uv/git/ref/tags/eac588ad8def6316056a12d4907a9d4d84ff7a3b --jq [.object.sha, .object.type] | @tsv (http block)
https://api.github.com/repos/github/gh-aw
- Triggering command: /usr/bin/gh gh api /repos/github/gh-aw --jq .default_branch (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

- Add Python NLP word-frequency histogram as Step 1 in dictation-prompt.md - Renumber old steps: Step 1→2, Step 2→3, Step 3→4 - Change @copilot: use @copilot in engines list instead of standalone copilot - Update term target: ~1000 → 256 in mission, guidelines, and success criteria - Trim glossary in SKILL.md from ~1000 to exactly 256 high-frequency terms - Remove standalone copilot entry (only @copilot remains in glossary) - Fix speech-to-text rule: "co-pilot" → @copilot Agent-Logs-Url: https://github.com/github/gh-aw/sessions/bdf8e7b5-b0cb-45b9-bee8-0e2bfa8f344d Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates the dictation skill guidance and glossary to use @copilot consistently, shrink the glossary to a curated 256 terms, and introduce a data-driven (histogram-based) term selection step for the generator workflow.

Changes:

Trim skills/dictation/SKILL.md glossary to 256 terms and correct "co-pilot" → @copilot.
Add an NLP word-frequency histogram step to .github/workflows/dictation-prompt.md to guide term selection.
Update workflow guidance to target 256 terms and switch engine mention to @copilot.

Show a summary per file

File	Description
skills/dictation/SKILL.md	Reduces/curates glossary entries and updates speech-to-text correction to `@copilot`.
.github/workflows/dictation-prompt.md	Adds histogram step and updates mission/guidelines to use `@copilot` and the 256-term target.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 2/2 changed files
Comments generated: 7

Copilot · 2026-04-26T11:04:28Z


 Create a concise dictation instruction file at `skills/dictation/SKILL.md` that:
-1. Contains a glossary of approximately 1000 project-specific terms extracted from documentation
+1. Contains a glossary of exactly 256 project-specific terms extracted from documentation


The workflow prompt is internally inconsistent: the mission requires 'exactly 256' terms, but the guidelines/success criteria allow a 240–270 range. Please make these consistent (either enforce exactly 256 everywhere, or change the mission text to reflect the acceptable range) so agents and reviewers have a single clear target.

Suggested change

1. Contains a glossary of exactly 256 project-specific terms extracted from documentation

1. Contains a glossary of 240–270 project-specific terms extracted from documentation

Copilot · 2026-04-26T11:04:28Z


 - Scan only `docs/src/content/docs/**/*.md` files
- Extract ~1000 terms (950-1050 acceptable)
+- Extract 256 terms (240-270 acceptable)


The workflow prompt is internally inconsistent: the mission requires 'exactly 256' terms, but the guidelines/success criteria allow a 240–270 range. Please make these consistent (either enforce exactly 256 everywhere, or change the mission text to reflect the acceptable range) so agents and reviewers have a single clear target.

Copilot · 2026-04-26T11:04:28Z

 - ✅ File `skills/dictation/SKILL.md` exists
 - ✅ Contains proper SKILL.md frontmatter (name, description)
- ✅ Contains ~1000 project-specific terms (950-1050 acceptable)
+- ✅ Contains 256 project-specific terms (240-270 acceptable)


The workflow prompt is internally inconsistent: the mission requires 'exactly 256' terms, but the guidelines/success criteria allow a 240–270 range. Please make these consistent (either enforce exactly 256 everywhere, or change the mission text to reflect the acceptable range) so agents and reviewers have a single clear target.

Copilot · 2026-04-26T11:04:29Z

-### 1. Scan Documentation for Project-Specific Glossary
+### 1. Run NLP Word-Frequency Histogram
+
+Run the following Python script to compute a word-frequency histogram of code-formatted tokens across all documentation files. Use the output as the **primary source** for selecting the 256 glossary terms — prefer tokens with high frequency that are project-specific (not generic English words).


The text says the histogram is of 'code-formatted tokens', but the script also counts non-backticked words via the second regex (which will pull in many generic English words). Either (mandatory) update the description to reflect that it includes non-code tokens, or (preferred) change the script to only count code-formatted identifiers (e.g., backticks and/or fenced code blocks) to better match the stated intent and reduce noise.

Copilot · 2026-04-26T11:04:29Z

+    # Collect backtick-quoted technical tokens
+    tokens.update(re.findall(r'`([^`\n]+)`', text))
+    # Also collect hyphenated/dotted/underscored identifiers
+    tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))


The text says the histogram is of 'code-formatted tokens', but the script also counts non-backticked words via the second regex (which will pull in many generic English words). Either (mandatory) update the description to reflect that it includes non-code tokens, or (preferred) change the script to only count code-formatted identifiers (e.g., backticks and/or fenced code blocks) to better match the stated intent and reduce noise.

Suggested change

# Collect backtick-quoted technical tokens

tokens.update(re.findall(r'`([^`\n]+)`', text))

# Also collect hyphenated/dotted/underscored identifiers

tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))

# Collect inline code tokens

tokens.update(re.findall(r'`([^`\n]+)`', text))

# Collect identifier-like tokens from fenced code blocks

for block in re.findall(r'```[^\n]*\n(.*?)```', text, flags=re.DOTALL):

tokens.update(re.findall(r'\b([A-Za-z_][\w\-\.]{1,}[A-Za-z0-9_])\b', block))

Copilot · 2026-04-26T11:04:29Z

+tokens = Counter()
+
+for md_file in docs.rglob("*.md"):
+    text = md_file.read_text(errors="replace")


For deterministic results across environments, it’s better to specify an explicit encoding when reading documentation files (e.g., UTF-8). Relying on the platform default encoding can change tokenization/histogram output between runners.

Suggested change

text = md_file.read_text(errors="replace")

text = md_file.read_text(encoding="utf-8", errors="replace")

Copilot · 2026-04-26T11:04:29Z

+    # Collect backtick-quoted technical tokens
+    tokens.update(re.findall(r'`([^`\n]+)`', text))
+    # Also collect hyphenated/dotted/underscored identifiers
+    tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))
+
+print("Frequency histogram — top 500 project tokens:")
+for tok, n in tokens.most_common(500):
+    if len(tok) > 2:
+        print(f"  {n:5d}  {tok}")


The len(tok) > 2 filter is redundant given the second regex enforces a minimum token length already, and the backtick regex will include multi-word phrases that may not be desired. Consider making the filtering criteria explicit (e.g., filter out whitespace-containing backtick matches, or consolidate filtering in one place) so the histogram output better matches the stated goal of 'project tokens'.

Suggested change

# Collect backtick-quoted technical tokens

tokens.update(re.findall(r'`([^`\n]+)`', text))

# Also collect hyphenated/dotted/underscored identifiers

tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))

print("Frequency histogram — top 500 project tokens:")

for tok, n in tokens.most_common(500):

if len(tok) > 2:

print(f" {n:5d} {tok}")

# Collect backtick-quoted technical tokens, excluding multi-word phrases

tokens.update(

tok

for tok in re.findall(r'`([^`\n]+)`', text)

if len(tok) > 2 and not re.search(r'\s', tok)

)

# Also collect hyphenated/dotted/underscored identifiers

tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))

print("Frequency histogram — top 500 project tokens:")

for tok, n in tokens.most_common(500):

print(f" {n:5d} {tok}")

Initial plan

3d0ed26

Copilot AI assigned Copilot and pelikhan Apr 26, 2026

Copilot started work on behalf of pelikhan April 26, 2026 10:50 View session

Copilot AI linked an issue Apr 26, 2026 that may be closed by this pull request

[q] fix(dictation): @copilot, NLP histogram step, 256-term glossary #28566

Closed

Copilot AI changed the title ~~[WIP] Fix dictation prompt generator workflow and glossary~~ fix(dictation): @copilot everywhere, NLP histogram step, 256-term glossary Apr 26, 2026

Copilot AI requested a review from pelikhan April 26, 2026 11:01

Copilot finished work on behalf of pelikhan April 26, 2026 11:01

pelikhan marked this pull request as ready for review April 26, 2026 11:01

Copilot AI review requested due to automatic review settings April 26, 2026 11:01

pelikhan merged commit 6df3e16 into main Apr 26, 2026
18 of 19 checks passed

pelikhan deleted the copilot/fix-dictation-prompt-generator branch April 26, 2026 11:01

Copilot started reviewing on behalf of pelikhan April 26, 2026 11:02 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

github-actions Bot mentioned this pull request Apr 26, 2026

[aw] No-Op Runs #28177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dictation): @copilot everywhere, NLP histogram step, 256-term glossary#28572

fix(dictation): @copilot everywhere, NLP histogram step, 256-term glossary#28572
pelikhan merged 2 commits intomainfrom
copilot/fix-dictation-prompt-generator

Copilot AI commented Apr 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	1. Contains a glossary of exactly 256 project-specific terms extracted from documentation
	1. Contains a glossary of 240–270 project-specific terms extracted from documentation

	text = md_file.read_text(errors="replace")
	text = md_file.read_text(encoding="utf-8", errors="replace")

Conversation

Copilot AI commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

skills/dictation/SKILL.md

.github/workflows/dictation-prompt.md

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 26, 2026 •

edited

Loading

`skills/dictation/SKILL.md`

`.github/workflows/dictation-prompt.md`