-
Notifications
You must be signed in to change notification settings - Fork 372
fix(dictation): @copilot everywhere, NLP histogram step, 256-term glossary #28572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -47,15 +47,42 @@ Extract technical vocabulary from documentation files and create a concise dicta | |||||||||||||||||||||||||||||||||||||||||||
| ## Your Mission | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| Create a concise dictation instruction file at `skills/dictation/SKILL.md` that: | ||||||||||||||||||||||||||||||||||||||||||||
| 1. Contains a glossary of approximately 1000 project-specific terms extracted from documentation | ||||||||||||||||||||||||||||||||||||||||||||
| 1. Contains a glossary of exactly 256 project-specific terms extracted from documentation | ||||||||||||||||||||||||||||||||||||||||||||
| 2. Provides instructions for fixing speech-to-text errors (ambiguous terms, spacing, hyphenation) | ||||||||||||||||||||||||||||||||||||||||||||
| 3. Provides instructions for "agentifying" text: removing filler words (humm, you know, um, uh, like, etc.), improving clarity, and making text more professional | ||||||||||||||||||||||||||||||||||||||||||||
| 4. Does NOT include planning guidelines or examples (keep it short and focused on error correction and text cleanup) | ||||||||||||||||||||||||||||||||||||||||||||
| 5. Includes guidelines to NOT plan or provide examples, just focus on fixing speech-to-text errors and improving text quality. | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| ## Task Steps | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| ### 1. Scan Documentation for Project-Specific Glossary | ||||||||||||||||||||||||||||||||||||||||||||
| ### 1. Run NLP Word-Frequency Histogram | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| Run the following Python script to compute a word-frequency histogram of code-formatted tokens across all documentation files. Use the output as the **primary source** for selecting the 256 glossary terms — prefer tokens with high frequency that are project-specific (not generic English words). | ||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| ```bash | ||||||||||||||||||||||||||||||||||||||||||||
| python3 - <<'EOF' | ||||||||||||||||||||||||||||||||||||||||||||
| import re | ||||||||||||||||||||||||||||||||||||||||||||
| from pathlib import Path | ||||||||||||||||||||||||||||||||||||||||||||
| from collections import Counter | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| docs = Path("docs/src/content/docs") | ||||||||||||||||||||||||||||||||||||||||||||
| tokens = Counter() | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| for md_file in docs.rglob("*.md"): | ||||||||||||||||||||||||||||||||||||||||||||
| text = md_file.read_text(errors="replace") | ||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
| text = md_file.read_text(errors="replace") | |
| text = md_file.read_text(encoding="utf-8", errors="replace") |
Copilot
AI
Apr 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text says the histogram is of 'code-formatted tokens', but the script also counts non-backticked words via the second regex (which will pull in many generic English words). Either (mandatory) update the description to reflect that it includes non-code tokens, or (preferred) change the script to only count code-formatted identifiers (e.g., backticks and/or fenced code blocks) to better match the stated intent and reduce noise.
| # Collect backtick-quoted technical tokens | |
| tokens.update(re.findall(r'`([^`\n]+)`', text)) | |
| # Also collect hyphenated/dotted/underscored identifiers | |
| tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text)) | |
| # Collect inline code tokens | |
| tokens.update(re.findall(r'`([^`\n]+)`', text)) | |
| # Collect identifier-like tokens from fenced code blocks | |
| for block in re.findall(r'```[^\n]*\n(.*?)```', text, flags=re.DOTALL): | |
| tokens.update(re.findall(r'\b([A-Za-z_][\w\-\.]{1,}[A-Za-z0-9_])\b', block)) |
Copilot
AI
Apr 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The len(tok) > 2 filter is redundant given the second regex enforces a minimum token length already, and the backtick regex will include multi-word phrases that may not be desired. Consider making the filtering criteria explicit (e.g., filter out whitespace-containing backtick matches, or consolidate filtering in one place) so the histogram output better matches the stated goal of 'project tokens'.
| # Collect backtick-quoted technical tokens | |
| tokens.update(re.findall(r'`([^`\n]+)`', text)) | |
| # Also collect hyphenated/dotted/underscored identifiers | |
| tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text)) | |
| print("Frequency histogram — top 500 project tokens:") | |
| for tok, n in tokens.most_common(500): | |
| if len(tok) > 2: | |
| print(f" {n:5d} {tok}") | |
| # Collect backtick-quoted technical tokens, excluding multi-word phrases | |
| tokens.update( | |
| tok | |
| for tok in re.findall(r'`([^`\n]+)`', text) | |
| if len(tok) > 2 and not re.search(r'\s', tok) | |
| ) | |
| # Also collect hyphenated/dotted/underscored identifiers | |
| tokens.update(re.findall(r'\b([\w][\w\-\.]{2,}[\w])\b', text)) | |
| print("Frequency histogram — top 500 project tokens:") | |
| for tok, n in tokens.most_common(500): | |
| print(f" {n:5d} {tok}") |
Copilot
AI
Apr 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow prompt is internally inconsistent: the mission requires 'exactly 256' terms, but the guidelines/success criteria allow a 240–270 range. Please make these consistent (either enforce exactly 256 everywhere, or change the mission text to reflect the acceptable range) so agents and reviewers have a single clear target.
Copilot
AI
Apr 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow prompt is internally inconsistent: the mission requires 'exactly 256' terms, but the guidelines/success criteria allow a 240–270 range. Please make these consistent (either enforce exactly 256 everywhere, or change the mission text to reflect the acceptable range) so agents and reviewers have a single clear target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow prompt is internally inconsistent: the mission requires 'exactly 256' terms, but the guidelines/success criteria allow a 240–270 range. Please make these consistent (either enforce exactly 256 everywhere, or change the mission text to reflect the acceptable range) so agents and reviewers have a single clear target.