Skip to content

perf(skills): prune scraped HTML doc references from integration skills#455

Open
kelsonpw wants to merge 1 commit intomainfrom
kelsonpw/prune-nextjs-skill
Open

perf(skills): prune scraped HTML doc references from integration skills#455
kelsonpw wants to merge 1 commit intomainfrom
kelsonpw/prune-nextjs-skill

Conversation

@kelsonpw
Copy link
Copy Markdown
Collaborator

@kelsonpw kelsonpw commented Apr 30, 2026

Summary

  • Every integration skill (24 variants) ships three full HTML scrapes of Amplitude doc pages under references/browser-sdk-2.md (~270 KB), browser-unified-sdk.md (~110 KB), amplitude-quickstart.md (~110 KB). That's ~12 MB of HTML noise in the npm package, dominated by HEAD / navigation / footer / JSON-LD boilerplate rather than actual reference content.
  • The integration workflow (basic-integration-1.1-edit.md) tells the agent "Read them now" after Skill load, so the cost lands on every cold-start run. EXAMPLE.md (~20 KB clean Markdown) already covers the practical init/track patterns.
  • Strip the three reference files at refresh time and remove their orphaned list entries from each SKILL.md so the agent doesn't try to Read missing files.

Changes

  • scripts/refresh-skills.sh
    • Adds an exclude list to unzip so a future pnpm skills:refresh can't reintroduce the heavy files.
    • Post-extract sed drops the dangling reference bullets from each SKILL.md.
  • skills/integration/*/SKILL.md × 32 — three list bullets removed each (72 lines total).
  • skills/integration/*/references/{browser-sdk-2, browser-unified-sdk, amplitude-quickstart}.md — 72 files deleted (~12 MB on disk).

Impact

  • ~16 MB removed from the npm package (197K lines deleted).
  • Agent's cold-start prompt budget cleaned up: when the workflow says "Read them now", there are now 1 reference file (EXAMPLE.md) instead of 4. Practical wall-time savings on the order of seconds per run.

Follow-up

  • The durable fix lives in amplitude/context-hub (transformation-config/skills/integration/config.yaml docs_urls). The script-level filter here is defense-in-depth so the wizard stays clean even before that change lands. A separate context-hub PR should drop the URLs.

Test plan

  • CONTEXT_HUB_DIST=… pnpm skills:refresh runs idempotently — re-running produces no diffs.
  • pnpm test src/lib/__tests__/commandments passes.
  • pnpm lint clean.
  • EXAMPLE.md still present and untouched in every variant — visual confirmation across Next.js, Vue, Angular, Astro, SvelteKit, React Native, etc.

cc @amplitude/growth

🤖 Generated with Claude Code


Note

Low Risk
Low risk: changes only affect bundled skill assets and the refresh script, but could cause missing-reference issues if any workflow still expects the removed files.

Overview
Reduces integration skill payload and agent startup context by removing large scraped HTML “doc” reference files from the repo and preventing them from being reintroduced on refresh.

scripts/refresh-skills.sh now excludes references/browser-sdk-2.md, references/browser-unified-sdk.md, and references/amplitude-quickstart.md during ZIP extraction and post-processes each extracted SKILL.md to delete any reference-list entries that point to those excluded files, avoiding broken reads.

Reviewed by Cursor Bugbot for commit 3e38e3b. Bugbot is set up for automated code reviews on this repo. Configure here.

Each integration skill (24 variants — Next.js, Vue, Angular, Astro,
SvelteKit, etc.) was shipping three full HTML scrapes of Amplitude's
documentation pages under `references/`:

  references/browser-sdk-2.md         ~270 KB
  references/browser-unified-sdk.md   ~110 KB
  references/amplitude-quickstart.md  ~110 KB
  ─────────────────────────────────────────────
  ~490 KB × 24 variants  =  ~12 MB of HTML noise

The files are full HTML scrapes (HEAD, navigation, footer, JSON-LD,
favicons) of Amplitude's doc pages — the actual reference content the
agent might want is buried under tens of KB of boilerplate.

The integration workflow (`basic-integration-1.1-edit.md`) tells the
agent: "Remember the documentation and example project resources you
were provided at the beginning. Read them now." So the cost lands on
EVERY cold-start run, with no opt-out — the agent dutifully Reads each
massive HTML blob and tokenizes the boilerplate. EXAMPLE.md (clean
Markdown, ~20 KB) already covers the practical init/track patterns.

Changes:
  - `scripts/refresh-skills.sh` learns to skip the three reference
    files during `unzip` (so a future `pnpm skills:refresh` can't
    reintroduce them) and post-extracts a `sed` to drop the dangling
    list entries from each SKILL.md (so the agent doesn't see
    orphaned references and try to Read missing files).
  - `skills/integration/*/SKILL.md` × 32 — the three reference list
    bullets removed (3 lines per file, 72 lines total). Done now so
    the diff is in the repo until the next refresh; the script keeps
    it that way after.
  - `skills/integration/*/references/{browser-sdk-2, browser-unified-sdk, amplitude-quickstart}.md`
    — 72 files deleted (~12 MB on disk).

Result: ~16 MB of HTML noise out of the npm package and out of the
agent's prompt budget on every integration run. EXAMPLE.md remains
the canonical practical reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

🧙 Wizard CI

Run the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands:

Test all apps:

  • /wizard-ci all

Test all apps in a directory:

  • /wizard-ci django
  • /wizard-ci fastapi
  • /wizard-ci flask
  • /wizard-ci javascript-node
  • /wizard-ci javascript-web
  • /wizard-ci next-js
  • /wizard-ci python
  • /wizard-ci react-router
  • /wizard-ci vue

Test an individual app:

  • /wizard-ci django/django3-saas
  • /wizard-ci fastapi/fastapi3-ai-saas
  • /wizard-ci flask/flask3-social-media
Show more apps
  • /wizard-ci javascript-node/express-todo
  • /wizard-ci javascript-node/fastify-blog
  • /wizard-ci javascript-node/hono-links
  • /wizard-ci javascript-node/koa-notes
  • /wizard-ci javascript-node/native-http-contacts
  • /wizard-ci javascript-web/saas-dashboard
  • /wizard-ci next-js/15-app-router-saas
  • /wizard-ci next-js/15-app-router-todo
  • /wizard-ci next-js/15-pages-router-saas
  • /wizard-ci next-js/15-pages-router-todo
  • /wizard-ci python/meeting-summarizer
  • /wizard-ci react-router/react-router-v7-project
  • /wizard-ci react-router/rrv7-starter
  • /wizard-ci react-router/saas-template
  • /wizard-ci react-router/shopper
  • /wizard-ci vue/movies

Results will be posted here when complete.

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Reversed unzip -x and -d argument order
    • Reordered the unzip arguments so -x patterns come before -d dir, matching the Info-ZIP documented grammar.

Create PR

Or push these changes by commenting:

@cursor push d4c8384aea
Preview (d4c8384aea)
diff --git a/scripts/refresh-skills.sh b/scripts/refresh-skills.sh
--- a/scripts/refresh-skills.sh
+++ b/scripts/refresh-skills.sh
@@ -83,7 +83,7 @@
   fi
 
   mkdir -p "$dest"
-  unzip -q -o "$zip" -d "$dest" -x "${SKILL_EXCLUDE_PATTERNS[@]}"
+  unzip -q -o "$zip" -x "${SKILL_EXCLUDE_PATTERNS[@]}" -d "$dest"
 
   # Drop dangling reference list entries in SKILL.md that point at the
   # files we just excluded — without this, the agent loads SKILL.md, sees

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 3e38e3b. Configure here.

Comment thread scripts/refresh-skills.sh

mkdir -p "$dest"
unzip -q -o "$zip" -d "$dest"
unzip -q -o "$zip" -d "$dest" -x "${SKILL_EXCLUDE_PATTERNS[@]}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reversed unzip -x and -d argument order

Medium Severity

The unzip call places -d "$dest" before -x "${SKILL_EXCLUDE_PATTERNS[@]}". The Info-ZIP unzip man page states that when both -x and -d are used, -d exdir must follow the -x xfile(s) list. With the current ordering, some unzip implementations may not recognize the exclude patterns, silently extracting the files the script intends to strip. The correct order is -x patterns -d dir.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3e38e3b. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant