DXD Webflow Site Scraper

A generalized Webflow site archiver with a UI dashboard. Crawls any Webflow site and produces a static, self-contained snapshot.

Features

Multi-site support: Manage multiple Webflow sites
Real-time progress: Watch crawls happen in real-time with SSE
Scheduled crawls: Set up cron-based schedules for automatic archiving
Asset localization: Downloads and rewrites all JS, CSS, images, fonts
Webflow-specific: Removes badges, normalizes lazy-loaded media
Download & Preview: Download archives as zip or preview in-browser

Architecture

dxd-webflow-scraper/
├── apps/
│   ├── web/          # Vite + TanStack Router dashboard
│   └── api/          # Hono API server
├── packages/
│   └── scraper/      # Core scraping logic
├── services/
│   └── worker/       # Background job processor

Current architecture reference: ARCHITECTURE_CURRENT.md
Historical design notes: ARCHITECTURE.md

Tech Stack

Frontend: Vite, React, TanStack Router, TanStack Query, Tailwind CSS
Backend: Hono, Drizzle ORM, PostgreSQL
Worker: BullMQ, Redis, Playwright
Scraper: Playwright, Cheerio

Getting Started

Prerequisites

Bun >= 1.3.6
Docker (for PostgreSQL and Redis)
Playwright browsers

Setup

Clone and install dependencies:

bun install

Start PostgreSQL and Redis:

docker-compose up -d

Set up environment variables:

cp .env.example .env
# Edit .env with your settings

Push database schema:

cd apps/api && bun run db:push

Install Playwright browsers:

bunx playwright install chromium

Configure GitHub OAuth (required for dashboard login):

Create a GitHub OAuth App in your org/account settings.
Set Authorization callback URL to:
- Local: http://localhost:3001/api/auth/callback/github
- Production: https://<your-api-domain>/api/auth/callback/github
Add these env vars for the API service:
- GITHUB_CLIENT_ID
- GITHUB_CLIENT_SECRET
- AUTH_SECRET (random high-entropy string)
- FRONTEND_URL (for redirects, e.g. http://localhost:5173 or your web app domain)
Add this env var for the web app:
- VITE_API_URL (e.g. http://localhost:3001 locally)

Development

Run all services in parallel:

# Terminal 1: API server
cd apps/api && bun run dev

# Terminal 2: Web dashboard
cd apps/web && bun run dev

# Terminal 3: Background worker
cd services/worker && bun run dev

Or use Turbo:

bun run dev

Open http://localhost:5173 for the dashboard.

Usage

Add a site: Go to Sites → Add Site, enter a Webflow URL
Start a crawl: Click "Start Crawl" on any site
Monitor progress: Watch real-time logs on the crawl detail page
Download: Once complete, download the archive or preview in-browser

Configuration

Site Settings

Concurrency: Number of pages to crawl in parallel (1-30)
Max Pages: Limit total pages (useful for testing)
Exclude Patterns: Regex patterns to skip certain URLs
Remove Webflow Badge: Strip the Webflow attribution badge

Storage Options

Local: Store archives on the server filesystem
S3/R2: Store in S3-compatible storage (Cloudflare R2, AWS S3, etc.)

API

Endpoint	Description
`GET /api/sites`	List all sites
`POST /api/sites`	Create a site
`POST /api/sites/:id/crawl`	Start a crawl
`GET /api/crawls`	List crawls
`GET /api/crawls/:id`	Get crawl details
`GET /api/sse/crawls/:id`	SSE stream for live logs
`GET /api/crawls/:id/download`	Download archive as zip
`GET /preview/:crawlId/*`	Preview archived files

Deployment

Railway

Create a new project with PostgreSQL and Redis
Deploy the API and worker as separate services
Set environment variables

Docker

Build and run with Docker:

docker build -t dxd-scraper .
docker run -e DATABASE_URL=... -e REDIS_URL=... dxd-scraper

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github/workflows		.github/workflows
.opencode/plans		.opencode/plans
apps		apps
docs/superpowers		docs/superpowers
packages		packages
services/worker		services/worker
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
ARCHITECTURE_CURRENT.md		ARCHITECTURE_CURRENT.md
README.md		README.md
SETUP-INFRASTRUCTURE.md		SETUP-INFRASTRUCTURE.md
bun.lock		bun.lock
docker-compose.yml		docker-compose.yml
package.json		package.json
tsconfig.json		tsconfig.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DXD Webflow Site Scraper

Features

Architecture

Tech Stack

Getting Started

Prerequisites

Setup

Development

Usage

Configuration

Site Settings

Storage Options

API

Deployment

Railway

Docker

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DXD Webflow Site Scraper

Features

Architecture

Tech Stack

Getting Started

Prerequisites

Setup

Development

Usage

Configuration

Site Settings

Storage Options

API

Deployment

Railway

Docker

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages