perf(skills): prune scraped HTML doc references from integration skills#455
perf(skills): prune scraped HTML doc references from integration skills#455
Conversation
Each integration skill (24 variants — Next.js, Vue, Angular, Astro,
SvelteKit, etc.) was shipping three full HTML scrapes of Amplitude's
documentation pages under `references/`:
references/browser-sdk-2.md ~270 KB
references/browser-unified-sdk.md ~110 KB
references/amplitude-quickstart.md ~110 KB
─────────────────────────────────────────────
~490 KB × 24 variants = ~12 MB of HTML noise
The files are full HTML scrapes (HEAD, navigation, footer, JSON-LD,
favicons) of Amplitude's doc pages — the actual reference content the
agent might want is buried under tens of KB of boilerplate.
The integration workflow (`basic-integration-1.1-edit.md`) tells the
agent: "Remember the documentation and example project resources you
were provided at the beginning. Read them now." So the cost lands on
EVERY cold-start run, with no opt-out — the agent dutifully Reads each
massive HTML blob and tokenizes the boilerplate. EXAMPLE.md (clean
Markdown, ~20 KB) already covers the practical init/track patterns.
Changes:
- `scripts/refresh-skills.sh` learns to skip the three reference
files during `unzip` (so a future `pnpm skills:refresh` can't
reintroduce them) and post-extracts a `sed` to drop the dangling
list entries from each SKILL.md (so the agent doesn't see
orphaned references and try to Read missing files).
- `skills/integration/*/SKILL.md` × 32 — the three reference list
bullets removed (3 lines per file, 72 lines total). Done now so
the diff is in the repo until the next refresh; the script keeps
it that way after.
- `skills/integration/*/references/{browser-sdk-2, browser-unified-sdk, amplitude-quickstart}.md`
— 72 files deleted (~12 MB on disk).
Result: ~16 MB of HTML noise out of the npm package and out of the
agent's prompt budget on every integration run. EXAMPLE.md remains
the canonical practical reference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🧙 Wizard CIRun the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands: Test all apps:
Test all apps in a directory:
Test an individual app:
Show more apps
Results will be posted here when complete. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Reversed unzip
-xand-dargument order- Reordered the unzip arguments so
-xpatterns come before-d dir, matching the Info-ZIP documented grammar.
- Reordered the unzip arguments so
Or push these changes by commenting:
@cursor push d4c8384aea
Preview (d4c8384aea)
diff --git a/scripts/refresh-skills.sh b/scripts/refresh-skills.sh
--- a/scripts/refresh-skills.sh
+++ b/scripts/refresh-skills.sh
@@ -83,7 +83,7 @@
fi
mkdir -p "$dest"
- unzip -q -o "$zip" -d "$dest" -x "${SKILL_EXCLUDE_PATTERNS[@]}"
+ unzip -q -o "$zip" -x "${SKILL_EXCLUDE_PATTERNS[@]}" -d "$dest"
# Drop dangling reference list entries in SKILL.md that point at the
# files we just excluded — without this, the agent loads SKILL.md, seesYou can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 3e38e3b. Configure here.
|
|
||
| mkdir -p "$dest" | ||
| unzip -q -o "$zip" -d "$dest" | ||
| unzip -q -o "$zip" -d "$dest" -x "${SKILL_EXCLUDE_PATTERNS[@]}" |
There was a problem hiding this comment.
Reversed unzip -x and -d argument order
Medium Severity
The unzip call places -d "$dest" before -x "${SKILL_EXCLUDE_PATTERNS[@]}". The Info-ZIP unzip man page states that when both -x and -d are used, -d exdir must follow the -x xfile(s) list. With the current ordering, some unzip implementations may not recognize the exclude patterns, silently extracting the files the script intends to strip. The correct order is -x patterns -d dir.
Reviewed by Cursor Bugbot for commit 3e38e3b. Configure here.



Summary
references/—browser-sdk-2.md(~270 KB),browser-unified-sdk.md(~110 KB),amplitude-quickstart.md(~110 KB). That's ~12 MB of HTML noise in the npm package, dominated by HEAD / navigation / footer / JSON-LD boilerplate rather than actual reference content.basic-integration-1.1-edit.md) tells the agent"Read them now"after Skill load, so the cost lands on every cold-start run. EXAMPLE.md (~20 KB clean Markdown) already covers the practical init/track patterns.SKILL.mdso the agent doesn't try to Read missing files.Changes
scripts/refresh-skills.shunzipso a futurepnpm skills:refreshcan't reintroduce the heavy files.seddrops the dangling reference bullets from eachSKILL.md.skills/integration/*/SKILL.md× 32 — three list bullets removed each (72 lines total).skills/integration/*/references/{browser-sdk-2, browser-unified-sdk, amplitude-quickstart}.md— 72 files deleted (~12 MB on disk).Impact
EXAMPLE.md) instead of 4. Practical wall-time savings on the order of seconds per run.Follow-up
amplitude/context-hub(transformation-config/skills/integration/config.yamldocs_urls). The script-level filter here is defense-in-depth so the wizard stays clean even before that change lands. A separate context-hub PR should drop the URLs.Test plan
CONTEXT_HUB_DIST=… pnpm skills:refreshruns idempotently — re-running produces no diffs.pnpm test src/lib/__tests__/commandmentspasses.pnpm lintclean.cc @amplitude/growth
🤖 Generated with Claude Code
Note
Low Risk
Low risk: changes only affect bundled skill assets and the refresh script, but could cause missing-reference issues if any workflow still expects the removed files.
Overview
Reduces integration skill payload and agent startup context by removing large scraped HTML “doc” reference files from the repo and preventing them from being reintroduced on refresh.
scripts/refresh-skills.shnow excludesreferences/browser-sdk-2.md,references/browser-unified-sdk.md, andreferences/amplitude-quickstart.mdduring ZIP extraction and post-processes each extractedSKILL.mdto delete any reference-list entries that point to those excluded files, avoiding broken reads.Reviewed by Cursor Bugbot for commit 3e38e3b. Bugbot is set up for automated code reviews on this repo. Configure here.