Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 36 additions & 9 deletions .github/workflows/dictation-prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,15 +47,42 @@ Extract technical vocabulary from documentation files and create a concise dicta
## Your Mission

Create a concise dictation instruction file at `skills/dictation/SKILL.md` that:
1. Contains a glossary of approximately 1000 project-specific terms extracted from documentation
1. Contains a glossary of exactly 256 project-specific terms extracted from documentation
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow prompt is internally inconsistent: the mission requires 'exactly 256' terms, but the guidelines/success criteria allow a 240–270 range. Please make these consistent (either enforce exactly 256 everywhere, or change the mission text to reflect the acceptable range) so agents and reviewers have a single clear target.

Suggested change
1. Contains a glossary of exactly 256 project-specific terms extracted from documentation
1. Contains a glossary of 240–270 project-specific terms extracted from documentation

Copilot uses AI. Check for mistakes.
2. Provides instructions for fixing speech-to-text errors (ambiguous terms, spacing, hyphenation)
3. Provides instructions for "agentifying" text: removing filler words (humm, you know, um, uh, like, etc.), improving clarity, and making text more professional
4. Does NOT include planning guidelines or examples (keep it short and focused on error correction and text cleanup)
5. Includes guidelines to NOT plan or provide examples, just focus on fixing speech-to-text errors and improving text quality.

## Task Steps

### 1. Scan Documentation for Project-Specific Glossary
### 1. Run NLP Word-Frequency Histogram

Run the following Python script to compute a word-frequency histogram of code-formatted tokens across all documentation files. Use the output as the **primary source** for selecting the 256 glossary terms — prefer tokens with high frequency that are project-specific (not generic English words).
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text says the histogram is of 'code-formatted tokens', but the script also counts non-backticked words via the second regex (which will pull in many generic English words). Either (mandatory) update the description to reflect that it includes non-code tokens, or (preferred) change the script to only count code-formatted identifiers (e.g., backticks and/or fenced code blocks) to better match the stated intent and reduce noise.

Copilot uses AI. Check for mistakes.

```bash
python3 - <<'EOF'
import re
from pathlib import Path
from collections import Counter

docs = Path("docs/src/content/docs")
tokens = Counter()

for md_file in docs.rglob("*.md"):
text = md_file.read_text(errors="replace")
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For deterministic results across environments, it’s better to specify an explicit encoding when reading documentation files (e.g., UTF-8). Relying on the platform default encoding can change tokenization/histogram output between runners.

Suggested change
text = md_file.read_text(errors="replace")
text = md_file.read_text(encoding="utf-8", errors="replace")

Copilot uses AI. Check for mistakes.
# Collect backtick-quoted technical tokens
tokens.update(re.findall(r'`([^`\n]+)`', text))
# Also collect hyphenated/dotted/underscored identifiers
tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))
Comment on lines +73 to +76
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text says the histogram is of 'code-formatted tokens', but the script also counts non-backticked words via the second regex (which will pull in many generic English words). Either (mandatory) update the description to reflect that it includes non-code tokens, or (preferred) change the script to only count code-formatted identifiers (e.g., backticks and/or fenced code blocks) to better match the stated intent and reduce noise.

Suggested change
# Collect backtick-quoted technical tokens
tokens.update(re.findall(r'`([^`\n]+)`', text))
# Also collect hyphenated/dotted/underscored identifiers
tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))
# Collect inline code tokens
tokens.update(re.findall(r'`([^`\n]+)`', text))
# Collect identifier-like tokens from fenced code blocks
for block in re.findall(r'```[^\n]*\n(.*?)```', text, flags=re.DOTALL):
tokens.update(re.findall(r'\b([A-Za-z_][\w\-\.]{1,}[A-Za-z0-9_])\b', block))

Copilot uses AI. Check for mistakes.

print("Frequency histogram — top 500 project tokens:")
for tok, n in tokens.most_common(500):
if len(tok) > 2:
print(f" {n:5d} {tok}")
Comment on lines +73 to +81
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The len(tok) > 2 filter is redundant given the second regex enforces a minimum token length already, and the backtick regex will include multi-word phrases that may not be desired. Consider making the filtering criteria explicit (e.g., filter out whitespace-containing backtick matches, or consolidate filtering in one place) so the histogram output better matches the stated goal of 'project tokens'.

Suggested change
# Collect backtick-quoted technical tokens
tokens.update(re.findall(r'`([^`\n]+)`', text))
# Also collect hyphenated/dotted/underscored identifiers
tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))
print("Frequency histogram — top 500 project tokens:")
for tok, n in tokens.most_common(500):
if len(tok) > 2:
print(f" {n:5d} {tok}")
# Collect backtick-quoted technical tokens, excluding multi-word phrases
tokens.update(
tok
for tok in re.findall(r'`([^`\n]+)`', text)
if len(tok) > 2 and not re.search(r'\s', tok)
)
# Also collect hyphenated/dotted/underscored identifiers
tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text))
print("Frequency histogram — top 500 project tokens:")
for tok, n in tokens.most_common(500):
print(f" {n:5d} {tok}")

Copilot uses AI. Check for mistakes.
EOF
```

### 2. Scan Documentation for Project-Specific Glossary

Use `search` to efficiently discover documentation covering different areas of the project, then read the returned files to extract vocabulary. This is more targeted than scanning all files with `find`:

Expand All @@ -68,7 +95,7 @@ Read each returned file path for its content, then also scan any remaining docum

**Focus areas for extraction:**
- Configuration: safe-outputs, permissions, tools, cache-memory, toolset, frontmatter
- Engines: copilot, claude, codex, custom
- Engines: @copilot, claude, codex, custom
- Bot mentions: @copilot (for GitHub issue assignment)
- Commands: compile, audit, logs, mcp, recompile
- GitHub concepts: workflow_dispatch, pull_request, issues, discussions
Expand All @@ -79,13 +106,13 @@ Read each returned file path for its content, then also scan any remaining docum

**Exclude**: makefile, Astro, starlight (tooling-specific, not user-facing)

### 2. Create the Dictation Instructions File
### 3. Create the Dictation Instructions File

Create `skills/dictation/SKILL.md` with:
- Frontmatter with name and description fields
- Title: Dictation Instructions
- Technical Context: Brief description of gh-aw
- Project Glossary: ~1000 terms, alphabetically sorted, one per line
- Project Glossary: 256 terms, alphabetically sorted, one per line
- Fix Speech-to-Text Errors: Common misrecognitions → correct terms
- Clean Up and Improve Text: Instructions for removing filler words and improving clarity
- Guidelines: General instructions as follows
Expand All @@ -100,7 +127,7 @@ You do not have enough background information to plan or provide code examples.
- maintain the user's intended meaning
```

### 3. Create Pull Request
### 4. Create Pull Request

Use the create-pull-request tool to submit your changes with:
- Title: "[docs] Update dictation skill instructions"
Expand All @@ -109,9 +136,9 @@ Use the create-pull-request tool to submit your changes with:
## Guidelines

- Scan only `docs/src/content/docs/**/*.md` files
- Extract ~1000 terms (950-1050 acceptable)
- Extract 256 terms (240-270 acceptable)
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow prompt is internally inconsistent: the mission requires 'exactly 256' terms, but the guidelines/success criteria allow a 240–270 range. Please make these consistent (either enforce exactly 256 everywhere, or change the mission text to reflect the acceptable range) so agents and reviewers have a single clear target.

Copilot uses AI. Check for mistakes.
- Exclude tooling-specific terms (makefile, Astro, starlight)
- Prioritize frequently used project-specific terms
- Prioritize frequently used project-specific terms (use NLP histogram from Step 1)
- Alphabetize the glossary
- No descriptions in glossary (just term names)
- Focus on fixing speech-to-text errors, not planning or examples
Expand All @@ -120,7 +147,7 @@ Use the create-pull-request tool to submit your changes with:

- ✅ File `skills/dictation/SKILL.md` exists
- ✅ Contains proper SKILL.md frontmatter (name, description)
- ✅ Contains ~1000 project-specific terms (950-1050 acceptable)
- ✅ Contains 256 project-specific terms (240-270 acceptable)
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow prompt is internally inconsistent: the mission requires 'exactly 256' terms, but the guidelines/success criteria allow a 240–270 range. Please make these consistent (either enforce exactly 256 everywhere, or change the mission text to reflect the acceptable range) so agents and reviewers have a single clear target.

Copilot uses AI. Check for mistakes.
- ✅ Terms extracted from documentation only
- ✅ Focuses on fixing speech-to-text errors
- ✅ Includes instructions for removing filler words and improving text clarity
Expand Down
Loading
Loading