Scalable Web Scraping Service

What does this server do?

This server is a high-performance, distributed web scraping and crawling platform designed for AI-powered data extraction at scale. It provides:

Concurrent scraping of multiple URLs using Playwright and FastAPI
Distributed job queueing with Celery and Redis/Upstash
Real-time job status tracking and resource monitoring
Dynamic rate limiting and backpressure to adapt to system load
API endpoints for submitting scrape jobs, checking status, and managing operations
Metrics and observability via Prometheus integration
Scalable deployment on Fly.io with multi-process and multi-worker support
Support for screenshots, PDFs, and JavaScript execution via API
Adaptive queue management and error tracking for robust operation

The server is suitable for large-scale, production-grade web crawling, data collection, and AI-driven content extraction tasks.

Overview

A high-performance, distributed web scraping service built with:

Python
FastAPI
Playwright
Upstash Redis
Fly.io Deployment

Features

Concurrent scraping of multiple URLs
Distributed job queueing
Scalable architecture
Real-time job status tracking

Prerequisites

Python 3.11+
Upstash Redis Account
Fly.io Account

Local Development Setup

Clone the repository
Install dependencies:
```
pip install -r requirements.txt
```

Set Upstash Redis Environment Variables:

export UPSTASH_REDIS_REST_URL=your_redis_url
export UPSTASH_REDIS_REST_TOKEN=your_redis_token

Run API Server:
```
uvicorn server:app --reload
```

Deployment

fly launch

API Endpoints

POST /scrape — Submit a scraping job (with URL(s), options)
POST /llm/job — Submit an LLM extraction job
GET /user/data — Get user data (requires authentication)
POST /config/dump — Evaluate and dump config
GET /ws/events — WebSocket endpoint for real-time events
GET /llm/job/{task_id} — Get LLM job status/result
GET /crawl/job/{task_id} — Get crawl job status/result
POST /crawl/job/{task_id}/cancel — Cancel a running crawl job
GET /metrics — Prometheus metrics endpoint
GET /health — Health check endpoint

Configuration

Adjust concurrency and worker settings in worker.py

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.old		Dockerfile.old
Makefile		Makefile
README.md		README.md
TEST.md		TEST.md
actions.py		actions.py
api.py		api.py
auth.py		auth.py
celery_app.py		celery_app.py
celery_config.py		celery_config.py
celery_distrbuted_queue.mmd		celery_distrbuted_queue.mmd
config.yml		config.yml
crawl.py		crawl.py
crawler_pool.py		crawler_pool.py
crawlstore.py		crawlstore.py
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
dynamic_selectors.py		dynamic_selectors.py
firestore.py		firestore.py
fly.toml		fly.toml
grafana.py		grafana.py
job.py		job.py
monitor.py		monitor.py
pyproject.toml		pyproject.toml
redisCache.py		redisCache.py
requirements.txt		requirements.txt
schemas.py		schemas.py
scrape.py		scrape.py
security.py		security.py
server.py		server.py
storage.py		storage.py
supervisord.conf		supervisord.conf
tasks.py		tasks.py
tasks_crawler_qeueu.mmd		tasks_crawler_qeueu.mmd
triggers.py		triggers.py
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scalable Web Scraping Service

What does this server do?

Overview

Features

Prerequisites

Local Development Setup

Deployment

API Endpoints

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

deepscrape/crawlagent

Folders and files

Latest commit

History

Repository files navigation

Scalable Web Scraping Service

What does this server do?

Overview

Features

Prerequisites

Local Development Setup

Deployment

API Endpoints

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages