A generalized Webflow site archiver with a UI dashboard. Crawls any Webflow site and produces a static, self-contained snapshot.
- Multi-site support: Manage multiple Webflow sites
- Real-time progress: Watch crawls happen in real-time with SSE
- Scheduled crawls: Set up cron-based schedules for automatic archiving
- Asset localization: Downloads and rewrites all JS, CSS, images, fonts
- Webflow-specific: Removes badges, normalizes lazy-loaded media
- Download & Preview: Download archives as zip or preview in-browser
dxd-webflow-scraper/
├── apps/
│ ├── web/ # Vite + TanStack Router dashboard
│ └── api/ # Hono API server
├── packages/
│ └── scraper/ # Core scraping logic
├── services/
│ └── worker/ # Background job processor
- Current architecture reference:
ARCHITECTURE_CURRENT.md - Historical design notes:
ARCHITECTURE.md
- Frontend: Vite, React, TanStack Router, TanStack Query, Tailwind CSS
- Backend: Hono, Drizzle ORM, PostgreSQL
- Worker: BullMQ, Redis, Playwright
- Scraper: Playwright, Cheerio
- Bun >= 1.3.6
- Docker (for PostgreSQL and Redis)
- Playwright browsers
- Clone and install dependencies:
bun install- Start PostgreSQL and Redis:
docker-compose up -d- Set up environment variables:
cp .env.example .env
# Edit .env with your settings- Push database schema:
cd apps/api && bun run db:push- Install Playwright browsers:
bunx playwright install chromium- Configure GitHub OAuth (required for dashboard login):
- Create a GitHub OAuth App in your org/account settings.
- Set
Authorization callback URLto:- Local:
http://localhost:3001/api/auth/callback/github - Production:
https://<your-api-domain>/api/auth/callback/github
- Local:
- Add these env vars for the API service:
GITHUB_CLIENT_IDGITHUB_CLIENT_SECRETAUTH_SECRET(random high-entropy string)FRONTEND_URL(for redirects, e.g.http://localhost:5173or your web app domain)
- Add this env var for the web app:
VITE_API_URL(e.g.http://localhost:3001locally)
Run all services in parallel:
# Terminal 1: API server
cd apps/api && bun run dev
# Terminal 2: Web dashboard
cd apps/web && bun run dev
# Terminal 3: Background worker
cd services/worker && bun run devOr use Turbo:
bun run devOpen http://localhost:5173 for the dashboard.
- Add a site: Go to Sites → Add Site, enter a Webflow URL
- Start a crawl: Click "Start Crawl" on any site
- Monitor progress: Watch real-time logs on the crawl detail page
- Download: Once complete, download the archive or preview in-browser
- Concurrency: Number of pages to crawl in parallel (1-30)
- Max Pages: Limit total pages (useful for testing)
- Exclude Patterns: Regex patterns to skip certain URLs
- Remove Webflow Badge: Strip the Webflow attribution badge
- Local: Store archives on the server filesystem
- S3/R2: Store in S3-compatible storage (Cloudflare R2, AWS S3, etc.)
| Endpoint | Description |
|---|---|
GET /api/sites |
List all sites |
POST /api/sites |
Create a site |
POST /api/sites/:id/crawl |
Start a crawl |
GET /api/crawls |
List crawls |
GET /api/crawls/:id |
Get crawl details |
GET /api/sse/crawls/:id |
SSE stream for live logs |
GET /api/crawls/:id/download |
Download archive as zip |
GET /preview/:crawlId/* |
Preview archived files |
- Create a new project with PostgreSQL and Redis
- Deploy the API and worker as separate services
- Set environment variables
Build and run with Docker:
docker build -t dxd-scraper .
docker run -e DATABASE_URL=... -e REDIS_URL=... dxd-scraperMIT