Skip to content

fix(dictation): @copilot everywhere, NLP histogram step, 256-term glossary#28572

Merged
pelikhan merged 2 commits intomainfrom
copilot/fix-dictation-prompt-generator
Apr 26, 2026
Merged

fix(dictation): @copilot everywhere, NLP histogram step, 256-term glossary#28572
pelikhan merged 2 commits intomainfrom
copilot/fix-dictation-prompt-generator

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 26, 2026

The dictation skill and its generator workflow had ~1000 undifferentiated terms, no data-driven term selection, and used bare copilot instead of @copilot.

skills/dictation/SKILL.md

  • Glossary trimmed from ~1000 → exactly 256 high-frequency, domain-curated terms
  • Standalone copilot entry removed; @copilot was already present
  • Speech-to-text rule corrected: "co-pilot"@copilot

.github/workflows/dictation-prompt.md

  • New Step 1 — precomputed Python NLP word-frequency histogram; agent uses it as the primary source for term selection rather than guessing:
    python3 - <<'EOF'
    import re
    from pathlib import Path
    from collections import Counter
    docs = Path("docs/src/content/docs")
    tokens = Counter()
    for md_file in docs.rglob("*.md"):
        text = md_file.read_text(errors="replace")
        tokens.update(re.findall(r'`([^`\n]+)`', text))
        tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))
    for tok, n in tokens.most_common(500):
        if len(tok) > 2:
            print(f"  {n:5d}  {tok}")
    EOF
  • Old steps renumbered (1→2, 2→3, 3→4)
  • Engine reference copilot@copilot in focus-area list
  • Term target updated to 256 (240–270 acceptable) throughout mission, guidelines, and success criteria

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/graphql
    • Triggering command: /usr/bin/gh /usr/bin/gh api graphql -f query=query($owner: String!, $name: String!) { repository(owner: $owner, name: $name) { hasDiscussionsEnabled } } -f owner=github -f name=gh-aw (http block)
    • Triggering command: /usr/bin/gh /usr/bin/gh api graphql -f query=query($owner: String!, $name: String!) { repository(owner: $owner, name: $name) { hasDiscussionsEnabled } } -f owner=github -f name=gh-aw 7Y47dTp/WwuDpz60-1 (http block)
  • https://api.github.com/repos/astral-sh/setup-uv/git/ref/tags/eac588ad8def6316056a12d4907a9d4d84ff7a3b
    • Triggering command: /usr/bin/gh gh api /repos/astral-sh/setup-uv/git/ref/tags/eac588ad8def6316056a12d4907a9d4d84ff7a3b --jq [.object.sha, .object.type] | @tsv (http block)
  • https://api.github.com/repos/github/gh-aw
    • Triggering command: /usr/bin/gh gh api /repos/github/gh-aw --jq .default_branch (http block)

If you need me to access, download, or install something from one of these locations, you can either:

- Add Python NLP word-frequency histogram as Step 1 in dictation-prompt.md
- Renumber old steps: Step 1→2, Step 2→3, Step 3→4
- Change @copilot: use @copilot in engines list instead of standalone copilot
- Update term target: ~1000 → 256 in mission, guidelines, and success criteria
- Trim glossary in SKILL.md from ~1000 to exactly 256 high-frequency terms
- Remove standalone copilot entry (only @copilot remains in glossary)
- Fix speech-to-text rule: "co-pilot" → @copilot

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/bdf8e7b5-b0cb-45b9-bee8-0e2bfa8f344d

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix dictation prompt generator workflow and glossary fix(dictation): @copilot everywhere, NLP histogram step, 256-term glossary Apr 26, 2026
Copilot AI requested a review from pelikhan April 26, 2026 11:01
@pelikhan pelikhan marked this pull request as ready for review April 26, 2026 11:01
Copilot AI review requested due to automatic review settings April 26, 2026 11:01
@pelikhan pelikhan merged commit 6df3e16 into main Apr 26, 2026
18 of 19 checks passed
@pelikhan pelikhan deleted the copilot/fix-dictation-prompt-generator branch April 26, 2026 11:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates the dictation skill guidance and glossary to use @copilot consistently, shrink the glossary to a curated 256 terms, and introduce a data-driven (histogram-based) term selection step for the generator workflow.

Changes:

  • Trim skills/dictation/SKILL.md glossary to 256 terms and correct "co-pilot"@copilot.
  • Add an NLP word-frequency histogram step to .github/workflows/dictation-prompt.md to guide term selection.
  • Update workflow guidance to target 256 terms and switch engine mention to @copilot.
Show a summary per file
File Description
skills/dictation/SKILL.md Reduces/curates glossary entries and updates speech-to-text correction to @copilot.
.github/workflows/dictation-prompt.md Adds histogram step and updates mission/guidelines to use @copilot and the 256-term target.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 7


Create a concise dictation instruction file at `skills/dictation/SKILL.md` that:
1. Contains a glossary of approximately 1000 project-specific terms extracted from documentation
1. Contains a glossary of exactly 256 project-specific terms extracted from documentation
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow prompt is internally inconsistent: the mission requires 'exactly 256' terms, but the guidelines/success criteria allow a 240–270 range. Please make these consistent (either enforce exactly 256 everywhere, or change the mission text to reflect the acceptable range) so agents and reviewers have a single clear target.

Suggested change
1. Contains a glossary of exactly 256 project-specific terms extracted from documentation
1. Contains a glossary of 240–270 project-specific terms extracted from documentation

Copilot uses AI. Check for mistakes.

- Scan only `docs/src/content/docs/**/*.md` files
- Extract ~1000 terms (950-1050 acceptable)
- Extract 256 terms (240-270 acceptable)
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow prompt is internally inconsistent: the mission requires 'exactly 256' terms, but the guidelines/success criteria allow a 240–270 range. Please make these consistent (either enforce exactly 256 everywhere, or change the mission text to reflect the acceptable range) so agents and reviewers have a single clear target.

Copilot uses AI. Check for mistakes.
- ✅ File `skills/dictation/SKILL.md` exists
- ✅ Contains proper SKILL.md frontmatter (name, description)
- ✅ Contains ~1000 project-specific terms (950-1050 acceptable)
- ✅ Contains 256 project-specific terms (240-270 acceptable)
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow prompt is internally inconsistent: the mission requires 'exactly 256' terms, but the guidelines/success criteria allow a 240–270 range. Please make these consistent (either enforce exactly 256 everywhere, or change the mission text to reflect the acceptable range) so agents and reviewers have a single clear target.

Copilot uses AI. Check for mistakes.
### 1. Scan Documentation for Project-Specific Glossary
### 1. Run NLP Word-Frequency Histogram

Run the following Python script to compute a word-frequency histogram of code-formatted tokens across all documentation files. Use the output as the **primary source** for selecting the 256 glossary terms — prefer tokens with high frequency that are project-specific (not generic English words).
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text says the histogram is of 'code-formatted tokens', but the script also counts non-backticked words via the second regex (which will pull in many generic English words). Either (mandatory) update the description to reflect that it includes non-code tokens, or (preferred) change the script to only count code-formatted identifiers (e.g., backticks and/or fenced code blocks) to better match the stated intent and reduce noise.

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +76
# Collect backtick-quoted technical tokens
tokens.update(re.findall(r'`([^`\n]+)`', text))
# Also collect hyphenated/dotted/underscored identifiers
tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text says the histogram is of 'code-formatted tokens', but the script also counts non-backticked words via the second regex (which will pull in many generic English words). Either (mandatory) update the description to reflect that it includes non-code tokens, or (preferred) change the script to only count code-formatted identifiers (e.g., backticks and/or fenced code blocks) to better match the stated intent and reduce noise.

Suggested change
# Collect backtick-quoted technical tokens
tokens.update(re.findall(r'`([^`\n]+)`', text))
# Also collect hyphenated/dotted/underscored identifiers
tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))
# Collect inline code tokens
tokens.update(re.findall(r'`([^`\n]+)`', text))
# Collect identifier-like tokens from fenced code blocks
for block in re.findall(r'```[^\n]*\n(.*?)```', text, flags=re.DOTALL):
tokens.update(re.findall(r'\b([A-Za-z_][\w\-\.]{1,}[A-Za-z0-9_])\b', block))

Copilot uses AI. Check for mistakes.
tokens = Counter()

for md_file in docs.rglob("*.md"):
text = md_file.read_text(errors="replace")
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For deterministic results across environments, it’s better to specify an explicit encoding when reading documentation files (e.g., UTF-8). Relying on the platform default encoding can change tokenization/histogram output between runners.

Suggested change
text = md_file.read_text(errors="replace")
text = md_file.read_text(encoding="utf-8", errors="replace")

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +81
# Collect backtick-quoted technical tokens
tokens.update(re.findall(r'`([^`\n]+)`', text))
# Also collect hyphenated/dotted/underscored identifiers
tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))

print("Frequency histogram — top 500 project tokens:")
for tok, n in tokens.most_common(500):
if len(tok) > 2:
print(f" {n:5d} {tok}")
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The len(tok) > 2 filter is redundant given the second regex enforces a minimum token length already, and the backtick regex will include multi-word phrases that may not be desired. Consider making the filtering criteria explicit (e.g., filter out whitespace-containing backtick matches, or consolidate filtering in one place) so the histogram output better matches the stated goal of 'project tokens'.

Suggested change
# Collect backtick-quoted technical tokens
tokens.update(re.findall(r'`([^`\n]+)`', text))
# Also collect hyphenated/dotted/underscored identifiers
tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))
print("Frequency histogram — top 500 project tokens:")
for tok, n in tokens.most_common(500):
if len(tok) > 2:
print(f" {n:5d} {tok}")
# Collect backtick-quoted technical tokens, excluding multi-word phrases
tokens.update(
tok
for tok in re.findall(r'`([^`\n]+)`', text)
if len(tok) > 2 and not re.search(r'\s', tok)
)
# Also collect hyphenated/dotted/underscored identifiers
tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))
print("Frequency histogram — top 500 project tokens:")
for tok, n in tokens.most_common(500):
print(f" {n:5d} {tok}")

Copilot uses AI. Check for mistakes.
@github-actions github-actions Bot mentioned this pull request Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[q] fix(dictation): @copilot, NLP histogram step, 256-term glossary

3 participants