feat: anthropic asset crawler — fetch + categorize + upload to s3#7
Open
abdout wants to merge 34 commits into
Open
feat: anthropic asset crawler — fetch + categorize + upload to s3#7abdout wants to merge 34 commits into
abdout wants to merge 34 commits into
Conversation
One-button `pnpm crawl:anthropic` re-crawls all Anthropic web surfaces (anthropic.com, claude.ai, claude.com, docs.claude.com, support.anthropic.com, github.com/anthropics), extracts every img/picture/video/icon/Lottie/font/PDF URL, downloads new ones, uploads to s3://hogwarts-databayt/anthropic/<cat>/, appends rows to src/components/root/anthropic/data.ts, and bumps LAST_CRAWLED. - Hybrid fetch: undici for static pages, Playwright for dynamic (/pricing, /claude/*, /product/*, /features/*, all docs) - Idempotent: skip on sourceUrl match in data.ts, sha256 manifest as resume log, HeadObject precheck before PutObject; no CloudFront invalidation needed - Decision-tree categorization across 16 categories with deterministic ranking - Slugs derived from img alt → caption → page URL → sha256 prefix; collisions resolved with 4-char hash suffix - Honors robots.txt, custom UA, 2 req/s per host, exponential backoff on 429/503 Dry run: 122/135 pages crawled, 468 new candidates after dedup against existing 169 assets, 456 successfully downloaded, categorization 80%+ confident. setup-cname.sh documents the additive cdn.databayt.org alias setup — existing d1dlwtcfl0db67.cloudfront.net keeps working unchanged for hogwarts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
15 tasks
cdn.databayt.org is already in use by another Vercel project (returns 307 redirect to /ar with a NEXT_LOCALE cookie), so we'd break it by reassigning. assets.databayt.org is free and more semantic for static-asset CDN. DNS already provisioned on Vercel via vercel CLI: vercel dns add databayt.org assets CNAME d1dlwtcfl0db67.cloudfront.net Also added CAA record allowing amazon.com (existing CAA whitelist only had pki.goog/sectigo.com/letsencrypt.org — would have failed ACM cert validation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Freed up cdn.databayt.org from the wildcard *.databayt.org → hogwarts routing by adding a specific CNAME via Vercel CLI: vercel dns add databayt.org cdn CNAME d1dlwtcfl0db67.cloudfront.net Specific record overrides wildcard, so cdn.databayt.org now resolves to CloudFront IPs (52.85.32.x). Hogwarts continues serving everything else under *.databayt.org untouched. Spare CNAME at assets.databayt.org left in place — additive, harmless, and a convenient backup alias. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Node 20.6+ supports --env-file natively (no dotenv dep needed). Without this, the crawler can't read AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY from .env at runtime — the upload phase silently no-ops with the safe "no creds" branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
String.prototype.replace() interprets $1, $2, $&, etc in the replacement string as backreferences. When asset alt text contained "$100 million...", the $1 got replaced with the captured close pattern (\n];\n\n// Computed stats), corrupting data.ts in 3 places during the first real crawl. Switching to a function callback avoids any $ substitution. Also expanded escape() to collapse newlines/tabs/multi-spaces in alt text into single spaces — defensive against pages with formatted alt attributes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reconciles docs/reality gap. Ships persistent captain state, adopts
Anthropic 2026 surface (Routines, Agent Teams, SKILL.md, Plugins,
hook expansion), wires BMAD-style sprint ceremonies, packages 7
plugins for role-based install, embeds <EngineCounts /> for dynamic
docs counts.
Phase 2.5 (Foundation Repair):
- E13 scripts/inventory.sh + docs/INVENTORY.md (single source of truth)
- E14 8 path-scoped rules + 10 missing memory files backfilled
- E15 49 allow + 10 deny permission rules; SessionStart hook wired
Phase 3 (CEO-Brain Captain):
- E16 decision-matrix.yaml (24 rules) + runway.sh + /captain skill
- E17 repositories.json expanded to 14 repos; 4 routine prompts
- E18 8 outreach templates × ar/en; pilot stage tracker
- E19 setup-apple-notes.sh + setup-windows.ps1 + dispatch upgrade
Phase 4 (Engine 2026):
- E20 64 skills migrated to SKILL.md (path-scoping, fork isolation)
- E21 captain + 9 leadership agents get tools/mcpServers/model/memory
- E22 12 hook events wired (was 5)
- E23 8 routines designed (manifest + prompts + setup script)
- E24 Agent Teams enabled, captain Agent Team mode + hogwarts-pilot team
- E25 7 plugins (kun-core + kun-captain + 5 role profiles)
Phase 5 (Methodology):
- E26 4 ceremony skills (/sprint-plan, /standup, /sprint-review, /refine)
+ 4 DoD checklists
- E27 kun/AGENTS.md (Vercel 20-section), .agents/ symlinks, sync watchdog
Phase 6 (Sustainability):
- E28 spend-telemetry.sh Stop hook, caching playbook, auto-throttle.sh,
cost routing yaml (35 sonnet / 7 haiku / 3 opus distribution)
- E29 <EngineCounts /> MDX component, generate-public-docs.sh,
in-repo inventory at .claude/memory/kun-inventory.json
Final inventory: 49 agents, 64 skills, 25 MCPs, 12 rules, 14 memory,
13 hooks, 7 plugins, 8 routines.
Source of truth: docs/EPICS-V4.md (130 stories, 6 phases).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /dispatch skill referenced an underlying dispatch.sh that didn't exist. Adds the actual Bash implementation: - write <channel> "<body>" [priority] [deadline] — Mac (Apple Notes via osascript) or Windows (gh issue --label captain) - read inbox|cowork|captain [n] - log [n] - 24h dedupe on identical writes - Auto-creates Dispatch folder + 3 notes if missing - Appends [decision|urgent] dispatches to ~/.claude/bridge.md - Logs every write to ~/.claude/memory/dispatch-log.jsonl for re-dispatch Updates /dispatch skill + command file to reference scripts/dispatch.sh (repo path, not user scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- content/docs/captain.mdx surfaces live engine counts via <EngineCounts /> and links to EPICS-V4 for the v4 roadmap - Rebuild propagates dispatch.sh path fix into kun-captain skill bundle - Rebuild propagates v4 settings.json (12 hook events) into kun-core hooks/hooks.json Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…w surface
Five skills, one taxonomy, dogfood the issue-first workflow.
/issue — 4 modes:
/issue interactive: type/area/size, create issue
/issue <text> one-shot: expand prompt to full issue
/issue resume <N> hydrate active-issue.json + dump full thread (the
*resume primitive* — pause/resume across sessions)
/issue list gh search across all databayt repos
/issue close close + summary comment + archive state
/branch — derives <type>/<issue#>-<slug> from active-issue, swithces to
branch from origin/main fresh, updates active-issue.json.branch.
/commit — drafts Conventional Commit from staged diff, attaches Refs #N,
signs (-S), runs through husky commit-msg hook (commitlint validates).
Co-Authored-By: Claude Opus 4.7 trailer always present.
/pr — opens draft PR using PR template, fills the Contribution declaration
block from active-issue (size, author from git log, Co-Authored-By from
branch trailers, design from figma URL in issue body, etc.). Captures URL
back into active-issue.json.
/close — closes active issue with summary, archives active-issue.json into
.claude/state/history/.
Every skill is rooted in .claude/state/active-issue.json. Hooks read it,
skills write it. Issues become the resumable memory channel: any session
can /issue resume <N> and pick up exactly where the prior session ended.
Refs #9
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…path - extract.ts: enqueue Webflow `-p-XXX` originals, exclude i18n/statsig/Mintlify favicon-mirror URLs, sanitize JSON-encoded entities (`&`, `&`). - fetcher.ts: bump Playwright timeout 45s→90s + retry once with networkidle wait (rescued anthropic.com/enterprise, claude.com root, agent-sdk/overview pages). - download.ts: sniff actual format from magic bytes (handles Sanity's server-side `?fm=webp` PNG→WebP conversion) and reject HTML 404 splashes (Mintlify serves /favicon.ico as 200 OK HTML). - index.ts: when the manifest already has matching sha256, reuse the existing S3 key and emit the data.ts row anyway (was returning silently — root cause of the 539-asset orphan gap that backfill-orphans.ts had to clean up). - Better default names: title-case derived from URL filename when alt is empty. Refs the catalog work in 654ffa6 and ee51e53. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
Update — orphan gap closed (2026-04-26)Pushed Final state (
|
| Before | After | |
|---|---|---|
| S3 anthropic/ objects | 713 | 1,137 |
data.ts asset rows |
174 | 1,137 |
| Orphans | 539 | 0 |
| Dangling | 0 | 0 |
What changed
extract.ts— enqueue Webflow-p-XXXoriginals (16-22% larger than responsive variants), excludei18n/,statsig, Mintlify favicon mirrors, sanitize&and&URL encodings.fetcher.ts— Playwright timeout 45s → 90s + retry once withnetworkidlewait. Recoveredclaude.com,anthropic.com/enterprise,agent-sdk/overview,mcppages.download.ts— magic-byte format sniffing (handles Sanity's?fm=webpPNG→WebP conversion); reject HTML 404 splashes (Mintlify serves/favicon.icoas 200 OK HTML).index.ts— whenmanifest[sourceUrl].sha256 === blob.sha256, reuse the existing key and emit the row instead of returning silently. This was the root cause of the 539-asset orphan gap.backfill-orphans.ts(new) — one-shot S3-to-data.ts reconciliation; reads sourceUrl back fromHeadObject.Metadataand emits proper rows.verify.ts(new) — read-only audit: reports orphans/dangling per category.data.tsformat union extended forgif+webm.
Remaining
- 3 page failures:
trust-center,security,legal/subprocessors— Cloudflare bot challenge, would need residential proxy. - ~233 categorizations marked "needs review" by the auto-categorizer; cosmetic, can be hand-tuned later.
.claude/rules/github-workflow.md — rewrite to match the new flow:
- Add `paths:` frontmatter (already present)
- Update branch examples to <type>/<issue#>-<slug> (e.g. feat/9-...)
- Replace label namespace: type:feat → type/feat, P0 → priority/p0,
add status/* labels managed by auto-status.yml
- Co-Authored-By trailer: Claude Opus 4.6 → Claude Opus 4.7 (matches AGENTS.md)
- Add Contribution declaration section under PR step
- Add Hooks-that-fire-automatically table (auto-issue, post-commit, etc.)
- Add CI workflows table (pr-check, signed-commits, contribution-declaration)
- Add size estimate locked at status/ready as the CU multiplier
CONTRIBUTING.md (new) — top-level contributor doc:
- License + CLA agreement (SSPL-1.0 + commercial grant)
- Issue templates table
- Branch naming + Conventional Commits + signed commits setup
- PR template + Contribution declaration block
- Sharing-economy revenue model summary (CU math, monthly distribution,
7-day dispute window, anti-gaming)
- Step-by-step "what you actually do" walkthrough
- Anti-patterns table
.gitignore — add .claude/state/ so per-session active-issue.json,
auto-issue-counter.json, history/ stay local. (.claude/scheduled_tasks.lock
also ignored.)
Refs #9
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Turbopack's MDX loader was parsing the `<` as an opening JSX tag and choking on the digit `5` (not a valid identifier start). Wrap in backticks so it renders as inline code instead. Unblocks Vercel builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onfig script Phase B — revenue ledger seed (under scratch/revenue-seed/, will move to a new public repo databayt/revenue in a follow-up PR): - RULES.md v1.0 — Contribution Unit (CU) table, distribution policy with TBD reserve % (locks in a separate small lock-PR before first cash distribution), 7 anti-gaming locks (story-point freeze, self-sponsor halving, monthly cap 100 CU/person, substantive-review threshold, PR-author guard, false-declaration penalty, closed-as-not-planned 0 CU). - README.md — public-facing repo description. - .github/workflows/contrib-tally.yml — nightly cron at 02:00 UTC, aggregates CU across all 13 databayt repos via gh api graphql, writes signed JSON snapshot to .snapshot/<date>.json. - .github/workflows/contrib-monthly.yml — 1st of month cron at 03:00 UTC, generates reports/monthly-report-<YYYY-MM>.md, opens tracking issue. - .github/ISSUE_TEMPLATE/dispute.yml — public dispute form (kind, artifact link, period, claim, evidence, proposed fix, acknowledgement). - scripts/contribution-report.sh — runnable from kun: queries gh api for closed issues + merged PRs across all live repos in the rolling window, computes minimal headline CU per assignee, writes leaderboard JSON. Full math (review-substance parsing, pair declaration, monthly cap, reserves) lives in databayt/revenue's tally.mjs once that repo exists. Phase C — replication infrastructure for the unified .github/ kit: - .claude/memory/area-dropdowns.json — per-repo `area` dropdown options. scripts/replicate-github-config.sh reads this when copying templates so each repo's 1-feat.yml + 2-fix.yml get the right dropdown without manual editing. 13 repo entries: kun, codebase, hogwarts, souq, mkan, shifa, marketing, swift-app, shadcn, radix, apple, distributed-computer, .github. - scripts/replicate-github-config.sh — reads repositories.json, iterates live-status repos, copies the .github/ kit + commitlint + husky + lint-staged + per-repo area dropdown injection, opens a PR per repo titled "chore(workflow): adopt unified databayt github config". Hogwarts is special-cased (already has 80% — keeps its existing pr-check.yml). Supports --repos x,y / --dry-run / --delay 24h. After this PR merges, run `bash scripts/replicate-github-config.sh --dry-run` first to preview, then drop --dry-run to open 12 follow-up PRs across the org. Cherry-pick or stagger as needed. Refs #9 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
One-button
pnpm crawl:anthropicre-crawls Anthropic web properties and mirrors all new assets to our existing S3 + CloudFront. Replaces the manual WebFetch loop. Idempotent — safe to run weekly.What changed
scripts/crawl-anthropic/— 11 TS modules + setup-cname.shindex.tsorchestrator with--dry-run,--upload-only,--skip-buildfetcher.ts: undici for static, Playwright for dynamic pagesextract.tsmines img / picture / video / link / meta / inline-css / Lottie URLscategorize.ts18-rule decision tree across 16 categoriesdownload.tsstreams + sha256 + dimensions (sharp for raster, viewBox for SVG, Lottie w/h for JSON)upload.tsHeadObject precheck + PutObject withCache-Control: immutableandMetadata: { sourceUrl, sha256 }diff.tsreads existingassets[]to skip already-mirrored URLsemit.tssplices new rows intodata.tsand bumpsLAST_CRAWLEDpackage.json— devDeps + 3 scripts (crawl:anthropic,crawl:anthropic:dry,crawl:anthropic:upload-only).gitignore— exclude crawler state dir.claude/commands/crawl-anthropic.md— rewritten to point at the new scriptCoverage (per user selection)
anthropic.com— full marketing surface (47 existing pages + new: pricing, customers, enterprise, trust, support, news pagination, research papers)claude.ai+claude.com— public marketing (auth-walled app routes skipped)docs.claude.com— Claude Code, Agent SDK, API, MCP, models, prompt-engineering, tool-use, prompt-caching, release-notessupport.anthropic.com— articles, depth 1github.com/anthropics— public repos and social previewsDry-run results
networkidletodomcontentloaded+ 2s settle)--needs-review(low-confidence categorization)Slugs are derived from
<img alt>so they're human-readable, e.g.anthropic/illustrations/an-update-on-our-election-safeguards.svganthropic/illustrations/introducing-claude-design-by-anthropic-labs.svgIdempotency
data.tsare matched bysourceUrland skipped — never re-uploadedstate/manifest.jsonis the resume log: mapssourceUrl → { sha256, key }. Crash-safeHeadObjectchecks the matching sha256 metadata; skip if already thereCNAME setup (user action required)
scripts/crawl-anthropic/setup-cname.shdocuments the 4 manual AWS steps to addcdn.databayt.orgas an alias to the existing CloudFront distribution. Adding the alias is purely additive: the originald1dlwtcfl0db67.cloudfront.netURL keeps working forever, so all hardcoded refs inhogwarts/(next.config.ts allowed-hostname,WA_CDN_BASE,WA_CHAT_BG,NEXT_PUBLIC_CDN_DOMAIN,cdn-manifest.json) stay healthy.cdn.databayt.org→d1dlwtcfl0db67.cloudfront.netcdn.databayt.org, validate via DNScdn.databayt.orgas alternate domain name + attach certcurl -sI https://cdn.databayt.org/anthropic/brand/claude-wordmark.svgreturns 200After step 4 passes, a one-line follow-up PR flips
CDN_BASEindata.tstohttps://cdn.databayt.org. Hogwarts continues using the old URL — both resolve to the same distribution.AWS credentials
The dry run did not need AWS credentials. The full run reads
AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY(orAWS_PROFILE) from env. Without credentials the script downloads + categorizes locally but skips the upload phase and does NOT mutatedata.ts(so the page never points at non-existent S3 keys).Test plan
npx tsc --noEmit— cleanpnpm build— cleanpnpm crawl:anthropic:dry— 122/135 pages, report.md generated, no fatal errorspnpm crawl:anthropic— run after AWS creds + CNAME setuplocalhost:3000/en/anthropicpost-run — spot-check 3 new assets renderTrademark
Anthropic's trademark guidelines restrict redistribution of brand assets. The
/anthropicpage is an internal showcase only.🤖 Generated with Claude Code