Skip to content

feat: anthropic asset crawler — fetch + categorize + upload to s3#7

Open
abdout wants to merge 34 commits into
mainfrom
feat/crawl-anthropic-assets
Open

feat: anthropic asset crawler — fetch + categorize + upload to s3#7
abdout wants to merge 34 commits into
mainfrom
feat/crawl-anthropic-assets

Conversation

@abdout
Copy link
Copy Markdown
Contributor

@abdout abdout commented Apr 25, 2026

Summary

One-button pnpm crawl:anthropic re-crawls Anthropic web properties and mirrors all new assets to our existing S3 + CloudFront. Replaces the manual WebFetch loop. Idempotent — safe to run weekly.

What changed

  • scripts/crawl-anthropic/ — 11 TS modules + setup-cname.sh
    • index.ts orchestrator with --dry-run, --upload-only, --skip-build
    • hybrid fetcher.ts: undici for static, Playwright for dynamic pages
    • extract.ts mines img / picture / video / link / meta / inline-css / Lottie URLs
    • categorize.ts 18-rule decision tree across 16 categories
    • download.ts streams + sha256 + dimensions (sharp for raster, viewBox for SVG, Lottie w/h for JSON)
    • upload.ts HeadObject precheck + PutObject with Cache-Control: immutable and Metadata: { sourceUrl, sha256 }
    • diff.ts reads existing assets[] to skip already-mirrored URLs
    • emit.ts splices new rows into data.ts and bumps LAST_CRAWLED
  • package.json — devDeps + 3 scripts (crawl:anthropic, crawl:anthropic:dry, crawl:anthropic:upload-only)
  • .gitignore — exclude crawler state dir
  • .claude/commands/crawl-anthropic.md — rewritten to point at the new script

Coverage (per user selection)

  • anthropic.com — full marketing surface (47 existing pages + new: pricing, customers, enterprise, trust, support, news pagination, research papers)
  • claude.ai + claude.com — public marketing (auth-walled app routes skipped)
  • docs.claude.com — Claude Code, Agent SDK, API, MCP, models, prompt-engineering, tool-use, prompt-caching, release-notes
  • support.anthropic.com — articles, depth 1
  • github.com/anthropics — public repos and social previews

Dry-run results

  • 122 / 135 pages crawled (13 page failures, mostly Playwright timeouts — fixed in the same commit by switching from networkidle to domcontentloaded + 2s settle)
  • 536 unique asset candidates discovered
  • 468 new (after dedup against the existing 169)
  • 456 successfully downloaded; 12 dead links flagged in report
  • 94 flagged --needs-review (low-confidence categorization)
  • Distribution: 154 partners, 104 illustrations, 102 brand, 30 fonts, 26 engineering, 13 social, 12 documents, 6 animations, plus values/benchmarks/events

Slugs are derived from <img alt> so they're human-readable, e.g.
anthropic/illustrations/an-update-on-our-election-safeguards.svg
anthropic/illustrations/introducing-claude-design-by-anthropic-labs.svg

Idempotency

  • Existing 169 assets in data.ts are matched by sourceUrl and skipped — never re-uploaded
  • state/manifest.json is the resume log: maps sourceUrl → { sha256, key }. Crash-safe
  • Before each upload, HeadObject checks the matching sha256 metadata; skip if already there
  • Versioned suffix on hash drift (no CloudFront invalidation needed)

CNAME setup (user action required)

scripts/crawl-anthropic/setup-cname.sh documents the 4 manual AWS steps to add cdn.databayt.org as an alias to the existing CloudFront distribution. Adding the alias is purely additive: the original d1dlwtcfl0db67.cloudfront.net URL keeps working forever, so all hardcoded refs in hogwarts/ (next.config.ts allowed-hostname, WA_CDN_BASE, WA_CHAT_BG, NEXT_PUBLIC_CDN_DOMAIN, cdn-manifest.json) stay healthy.

Step Where What
1 Route 53 / DNS CNAME cdn.databayt.orgd1dlwtcfl0db67.cloudfront.net
2 ACM (us-east-1) Request public cert for cdn.databayt.org, validate via DNS
3 CloudFront Add cdn.databayt.org as alternate domain name + attach cert
4 Terminal curl -sI https://cdn.databayt.org/anthropic/brand/claude-wordmark.svg returns 200

After step 4 passes, a one-line follow-up PR flips CDN_BASE in data.ts to https://cdn.databayt.org. Hogwarts continues using the old URL — both resolve to the same distribution.

AWS credentials

The dry run did not need AWS credentials. The full run reads AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY (or AWS_PROFILE) from env. Without credentials the script downloads + categorizes locally but skips the upload phase and does NOT mutate data.ts (so the page never points at non-existent S3 keys).

Test plan

  • npx tsc --noEmit — clean
  • pnpm build — clean
  • pnpm crawl:anthropic:dry — 122/135 pages, report.md generated, no fatal errors
  • pnpm crawl:anthropic — run after AWS creds + CNAME setup
  • Open localhost:3000/en/anthropic post-run — spot-check 3 new assets render

Trademark

Anthropic's trademark guidelines restrict redistribution of brand assets. The /anthropic page is an internal showcase only.

🤖 Generated with Claude Code

One-button `pnpm crawl:anthropic` re-crawls all Anthropic web surfaces
(anthropic.com, claude.ai, claude.com, docs.claude.com, support.anthropic.com,
github.com/anthropics), extracts every img/picture/video/icon/Lottie/font/PDF
URL, downloads new ones, uploads to s3://hogwarts-databayt/anthropic/<cat>/,
appends rows to src/components/root/anthropic/data.ts, and bumps LAST_CRAWLED.

- Hybrid fetch: undici for static pages, Playwright for dynamic (/pricing,
  /claude/*, /product/*, /features/*, all docs)
- Idempotent: skip on sourceUrl match in data.ts, sha256 manifest as resume log,
  HeadObject precheck before PutObject; no CloudFront invalidation needed
- Decision-tree categorization across 16 categories with deterministic ranking
- Slugs derived from img alt → caption → page URL → sha256 prefix; collisions
  resolved with 4-char hash suffix
- Honors robots.txt, custom UA, 2 req/s per host, exponential backoff on 429/503

Dry run: 122/135 pages crawled, 468 new candidates after dedup against existing
169 assets, 456 successfully downloaded, categorization 80%+ confident.

setup-cname.sh documents the additive cdn.databayt.org alias setup —
existing d1dlwtcfl0db67.cloudfront.net keeps working unchanged for hogwarts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 25, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
kun Ready Ready Preview, Comment Apr 28, 2026 8:34am

cdn.databayt.org is already in use by another Vercel project (returns 307
redirect to /ar with a NEXT_LOCALE cookie), so we'd break it by reassigning.
assets.databayt.org is free and more semantic for static-asset CDN.

DNS already provisioned on Vercel via vercel CLI:
  vercel dns add databayt.org assets CNAME d1dlwtcfl0db67.cloudfront.net

Also added CAA record allowing amazon.com (existing CAA whitelist only had
pki.goog/sectigo.com/letsencrypt.org — would have failed ACM cert validation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Freed up cdn.databayt.org from the wildcard *.databayt.org → hogwarts
routing by adding a specific CNAME via Vercel CLI:

  vercel dns add databayt.org cdn CNAME d1dlwtcfl0db67.cloudfront.net

Specific record overrides wildcard, so cdn.databayt.org now resolves to
CloudFront IPs (52.85.32.x). Hogwarts continues serving everything else
under *.databayt.org untouched.

Spare CNAME at assets.databayt.org left in place — additive, harmless,
and a convenient backup alias.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Node 20.6+ supports --env-file natively (no dotenv dep needed). Without
this, the crawler can't read AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
from .env at runtime — the upload phase silently no-ops with the safe
"no creds" branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
String.prototype.replace() interprets $1, $2, $&, etc in the replacement
string as backreferences. When asset alt text contained "$100 million...",
the $1 got replaced with the captured close pattern (\n];\n\n// Computed
stats), corrupting data.ts in 3 places during the first real crawl.

Switching to a function callback avoids any $ substitution. Also expanded
escape() to collapse newlines/tabs/multi-spaces in alt text into single
spaces — defensive against pages with formatted alt attributes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
abdout and others added 10 commits April 25, 2026 22:12
Reconciles docs/reality gap. Ships persistent captain state, adopts
Anthropic 2026 surface (Routines, Agent Teams, SKILL.md, Plugins,
hook expansion), wires BMAD-style sprint ceremonies, packages 7
plugins for role-based install, embeds <EngineCounts /> for dynamic
docs counts.

Phase 2.5 (Foundation Repair):
- E13 scripts/inventory.sh + docs/INVENTORY.md (single source of truth)
- E14 8 path-scoped rules + 10 missing memory files backfilled
- E15 49 allow + 10 deny permission rules; SessionStart hook wired

Phase 3 (CEO-Brain Captain):
- E16 decision-matrix.yaml (24 rules) + runway.sh + /captain skill
- E17 repositories.json expanded to 14 repos; 4 routine prompts
- E18 8 outreach templates × ar/en; pilot stage tracker
- E19 setup-apple-notes.sh + setup-windows.ps1 + dispatch upgrade

Phase 4 (Engine 2026):
- E20 64 skills migrated to SKILL.md (path-scoping, fork isolation)
- E21 captain + 9 leadership agents get tools/mcpServers/model/memory
- E22 12 hook events wired (was 5)
- E23 8 routines designed (manifest + prompts + setup script)
- E24 Agent Teams enabled, captain Agent Team mode + hogwarts-pilot team
- E25 7 plugins (kun-core + kun-captain + 5 role profiles)

Phase 5 (Methodology):
- E26 4 ceremony skills (/sprint-plan, /standup, /sprint-review, /refine)
       + 4 DoD checklists
- E27 kun/AGENTS.md (Vercel 20-section), .agents/ symlinks, sync watchdog

Phase 6 (Sustainability):
- E28 spend-telemetry.sh Stop hook, caching playbook, auto-throttle.sh,
       cost routing yaml (35 sonnet / 7 haiku / 3 opus distribution)
- E29 <EngineCounts /> MDX component, generate-public-docs.sh,
       in-repo inventory at .claude/memory/kun-inventory.json

Final inventory: 49 agents, 64 skills, 25 MCPs, 12 rules, 14 memory,
13 hooks, 7 plugins, 8 routines.

Source of truth: docs/EPICS-V4.md (130 stories, 6 phases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /dispatch skill referenced an underlying dispatch.sh that didn't
exist. Adds the actual Bash implementation:

- write <channel> "<body>" [priority] [deadline] — Mac (Apple Notes via
  osascript) or Windows (gh issue --label captain)
- read inbox|cowork|captain [n]
- log [n]
- 24h dedupe on identical writes
- Auto-creates Dispatch folder + 3 notes if missing
- Appends [decision|urgent] dispatches to ~/.claude/bridge.md
- Logs every write to ~/.claude/memory/dispatch-log.jsonl for re-dispatch

Updates /dispatch skill + command file to reference scripts/dispatch.sh
(repo path, not user scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- content/docs/captain.mdx surfaces live engine counts via <EngineCounts />
  and links to EPICS-V4 for the v4 roadmap
- Rebuild propagates dispatch.sh path fix into kun-captain skill bundle
- Rebuild propagates v4 settings.json (12 hook events) into kun-core
  hooks/hooks.json

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…w surface

Five skills, one taxonomy, dogfood the issue-first workflow.

/issue — 4 modes:
  /issue                  interactive: type/area/size, create issue
  /issue <text>           one-shot: expand prompt to full issue
  /issue resume <N>       hydrate active-issue.json + dump full thread (the
                          *resume primitive* — pause/resume across sessions)
  /issue list             gh search across all databayt repos
  /issue close            close + summary comment + archive state

/branch — derives <type>/<issue#>-<slug> from active-issue, swithces to
  branch from origin/main fresh, updates active-issue.json.branch.

/commit — drafts Conventional Commit from staged diff, attaches Refs #N,
  signs (-S), runs through husky commit-msg hook (commitlint validates).
  Co-Authored-By: Claude Opus 4.7 trailer always present.

/pr — opens draft PR using PR template, fills the Contribution declaration
  block from active-issue (size, author from git log, Co-Authored-By from
  branch trailers, design from figma URL in issue body, etc.). Captures URL
  back into active-issue.json.

/close — closes active issue with summary, archives active-issue.json into
  .claude/state/history/.

Every skill is rooted in .claude/state/active-issue.json. Hooks read it,
skills write it. Issues become the resumable memory channel: any session
can /issue resume <N> and pick up exactly where the prior session ended.

Refs #9

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…path

- extract.ts: enqueue Webflow `-p-XXX` originals, exclude i18n/statsig/Mintlify
  favicon-mirror URLs, sanitize JSON-encoded entities (`&`, `&amp;`).
- fetcher.ts: bump Playwright timeout 45s→90s + retry once with networkidle wait
  (rescued anthropic.com/enterprise, claude.com root, agent-sdk/overview pages).
- download.ts: sniff actual format from magic bytes (handles Sanity's
  server-side `?fm=webp` PNG→WebP conversion) and reject HTML 404 splashes
  (Mintlify serves /favicon.ico as 200 OK HTML).
- index.ts: when the manifest already has matching sha256, reuse the existing
  S3 key and emit the data.ts row anyway (was returning silently — root cause
  of the 539-asset orphan gap that backfill-orphans.ts had to clean up).
- Better default names: title-case derived from URL filename when alt is empty.

Refs the catalog work in 654ffa6 and ee51e53.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@abdout
Copy link
Copy Markdown
Contributor Author

abdout commented Apr 26, 2026

Update — orphan gap closed (2026-04-26)

Pushed 9ef9356: hardened the extractor, downloader, and the orphan-emit path that was responsible for the catalog stalling at 174 rows while S3 grew to 713.

Final state (scripts/crawl-anthropic/verify.ts)

Before After
S3 anthropic/ objects 713 1,137
data.ts asset rows 174 1,137
Orphans 539 0
Dangling 0 0

What changed

  • extract.ts — enqueue Webflow -p-XXX originals (16-22% larger than responsive variants), exclude i18n/, statsig, Mintlify favicon mirrors, sanitize & and &amp; URL encodings.
  • fetcher.ts — Playwright timeout 45s → 90s + retry once with networkidle wait. Recovered claude.com, anthropic.com/enterprise, agent-sdk/overview, mcp pages.
  • download.ts — magic-byte format sniffing (handles Sanity's ?fm=webp PNG→WebP conversion); reject HTML 404 splashes (Mintlify serves /favicon.ico as 200 OK HTML).
  • index.ts — when manifest[sourceUrl].sha256 === blob.sha256, reuse the existing key and emit the row instead of returning silently. This was the root cause of the 539-asset orphan gap.
  • backfill-orphans.ts (new) — one-shot S3-to-data.ts reconciliation; reads sourceUrl back from HeadObject.Metadata and emits proper rows.
  • verify.ts (new) — read-only audit: reports orphans/dangling per category.
  • data.ts format union extended for gif + webm.

Remaining

  • 3 page failures: trust-center, security, legal/subprocessors — Cloudflare bot challenge, would need residential proxy.
  • ~233 categorizations marked "needs review" by the auto-categorizer; cosmetic, can be hand-tuned later.

abdout and others added 11 commits April 26, 2026 14:22
.claude/rules/github-workflow.md — rewrite to match the new flow:
  - Add `paths:` frontmatter (already present)
  - Update branch examples to <type>/<issue#>-<slug> (e.g. feat/9-...)
  - Replace label namespace: type:feat → type/feat, P0 → priority/p0,
    add status/* labels managed by auto-status.yml
  - Co-Authored-By trailer: Claude Opus 4.6 → Claude Opus 4.7 (matches AGENTS.md)
  - Add Contribution declaration section under PR step
  - Add Hooks-that-fire-automatically table (auto-issue, post-commit, etc.)
  - Add CI workflows table (pr-check, signed-commits, contribution-declaration)
  - Add size estimate locked at status/ready as the CU multiplier

CONTRIBUTING.md (new) — top-level contributor doc:
  - License + CLA agreement (SSPL-1.0 + commercial grant)
  - Issue templates table
  - Branch naming + Conventional Commits + signed commits setup
  - PR template + Contribution declaration block
  - Sharing-economy revenue model summary (CU math, monthly distribution,
    7-day dispute window, anti-gaming)
  - Step-by-step "what you actually do" walkthrough
  - Anti-patterns table

.gitignore — add .claude/state/ so per-session active-issue.json,
auto-issue-counter.json, history/ stay local. (.claude/scheduled_tasks.lock
also ignored.)

Refs #9

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Turbopack's MDX loader was parsing the `<` as an opening JSX tag and
choking on the digit `5` (not a valid identifier start). Wrap in
backticks so it renders as inline code instead. Unblocks Vercel builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onfig script

Phase B — revenue ledger seed (under scratch/revenue-seed/, will move to a new
public repo databayt/revenue in a follow-up PR):

- RULES.md v1.0 — Contribution Unit (CU) table, distribution policy with
  TBD reserve % (locks in a separate small lock-PR before first cash
  distribution), 7 anti-gaming locks (story-point freeze, self-sponsor
  halving, monthly cap 100 CU/person, substantive-review threshold,
  PR-author guard, false-declaration penalty, closed-as-not-planned 0 CU).
- README.md — public-facing repo description.
- .github/workflows/contrib-tally.yml — nightly cron at 02:00 UTC,
  aggregates CU across all 13 databayt repos via gh api graphql, writes
  signed JSON snapshot to .snapshot/<date>.json.
- .github/workflows/contrib-monthly.yml — 1st of month cron at 03:00 UTC,
  generates reports/monthly-report-<YYYY-MM>.md, opens tracking issue.
- .github/ISSUE_TEMPLATE/dispute.yml — public dispute form (kind, artifact
  link, period, claim, evidence, proposed fix, acknowledgement).
- scripts/contribution-report.sh — runnable from kun: queries gh api for
  closed issues + merged PRs across all live repos in the rolling window,
  computes minimal headline CU per assignee, writes leaderboard JSON.
  Full math (review-substance parsing, pair declaration, monthly cap,
  reserves) lives in databayt/revenue's tally.mjs once that repo exists.

Phase C — replication infrastructure for the unified .github/ kit:

- .claude/memory/area-dropdowns.json — per-repo `area` dropdown options.
  scripts/replicate-github-config.sh reads this when copying templates so
  each repo's 1-feat.yml + 2-fix.yml get the right dropdown without manual
  editing. 13 repo entries: kun, codebase, hogwarts, souq, mkan, shifa,
  marketing, swift-app, shadcn, radix, apple, distributed-computer, .github.
- scripts/replicate-github-config.sh — reads repositories.json, iterates
  live-status repos, copies the .github/ kit + commitlint + husky +
  lint-staged + per-repo area dropdown injection, opens a PR per repo
  titled "chore(workflow): adopt unified databayt github config".
  Hogwarts is special-cased (already has 80% — keeps its existing
  pr-check.yml). Supports --repos x,y / --dry-run / --delay 24h.

After this PR merges, run `bash scripts/replicate-github-config.sh --dry-run`
first to preview, then drop --dry-run to open 12 follow-up PRs across the
org. Cherry-pick or stagger as needed.

Refs #9

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant