ScrapeOps — Scraper Orchestration Platform

A production-ready Django backend that acts as the control plane for all scraping infrastructure. It orchestrates scraping workflows across multiple execution layers (GraphQL → Hiker → DOM), tracks every retry and fallback, and provides full observability into pipeline performance.

Architecture

Client Request
      │
      ▼
Django Backend (Control Plane)
      ├── Redis      — deduplication cache + Celery broker
      ├── PostgreSQL — job state, events, metrics
      ├── MongoDB    — raw worker payloads
      │
      ▼
Celery Worker (dispatch_worker_task)
      │
      ▼
AWS Lambda Workers (Execution Plane)
      ├── GraphQL Lambda   — GraphQL scraping only
      ├── Hiker Lambda     — Hiker API scraping only
      └── DOM Lambda       — Browser-based scraping only
            │
            ▼
      Worker Response → Django Orchestration Engine
                              │
                        ┌─────┴──────┐
                     success?     failure?
                        │              │
                   complete      retry / fallback
                        │              │
                   callback      next layer

Fallback Chain

GraphQL  →(fail)→  Hiker  →(fail)→  DOM  →(fail)→  Job FAILED

Immediate fallback errors (no retry on same layer): rate_limit, blocked Retry-eligible errors: timeout, parsing_error, unknown

Quick Start

1. Clone & configure

git clone <repo>
cd scrapeops
cp .env .env.local   # edit with your values

2. Start all services

docker compose up --build

This starts:

scrapeops_postgres on port 5433
scrapeops_redis on port 6380
scrapeops_mongo on port 27018
scrapeops_backend on port 8000 (auto-migrates on start)
scrapeops_worker — Celery worker (scraping + callbacks queues)
scrapeops_beat — Celery beat scheduler

3. Verify

curl http://localhost:8000/api/v1/workers/health/

Expected:

{
  "mock_mode": true,
  "layers": ["graphql", "hiker", "dom"],
  "lambda_arns": {"graphql": false, "hiker": false, "dom": false}
}

4. Browse API docs

Open: http://localhost:8000/api/docs/

API Reference

Jobs

Method	Endpoint	Description
`POST`	`/api/v1/jobs/submit/`	Submit a new scraping job
`GET`	`/api/v1/jobs/`	List all jobs
`GET`	`/api/v1/jobs/{id}/`	Get job detail + event log
`POST`	`/api/v1/jobs/{id}/cancel/`	Cancel a running job
`GET`	`/api/v1/jobs/{id}/events/`	Get job event timeline
`GET`	`/api/v1/jobs/stats/`	Aggregate job statistics

Submit a job

curl -X POST http://localhost:8000/api/v1/jobs/submit/ \
  -H "Content-Type: application/json" \
  -d '{
    "target_type": "instagram_post",
    "target_id": "CxYz123abc",
    "target_url": "https://www.instagram.com/p/CxYz123abc/",
    "comments_requested": 200,
    "max_retries": 2,
    "callback_url": "https://your-app.com/webhooks/scrapeops"
  }'

Response:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "current_layer": "graphql",
  "comments_requested": 200,
  "comments_fetched": 0,
  "retry_count": 0,
  "fallback_count": 0,
  "events": []
}

Job with full event trail (after completion)

{
  "id": "550e8400-...",
  "status": "success",
  "completed_layer": "hiker",
  "comments_fetched": 200,
  "retry_count": 1,
  "fallback_count": 1,
  "total_time_ms": 5840,
  "events": [
    {
      "layer": "graphql",
      "status": "failed",
      "error_type": "rate_limit",
      "comments_fetched": 0,
      "time_taken_ms": 320
    },
    {
      "layer": "hiker",
      "status": "success",
      "error_type": "none",
      "comments_fetched": 200,
      "time_taken_ms": 2800
    }
  ]
}

Metrics

All endpoints accept ?hours=N (default: 24).

Endpoint	Description
`GET /api/v1/metrics/overview/`	Success rate, avg time, comment throughput
`GET /api/v1/metrics/layers/`	Per-layer success rates and avg time
`GET /api/v1/metrics/errors/`	Error type frequency breakdown
`GET /api/v1/metrics/throughput/`	Comments fetched per hour
`GET /api/v1/metrics/retries/`	Retry and fallback analysis

# Last 48 hours of layer performance
curl "http://localhost:8000/api/v1/metrics/layers/?hours=48"

Workers

Endpoint	Description
`GET /api/v1/workers/health/`	Worker config and mock mode status
`POST /api/v1/workers/redispatch/`	Manually re-dispatch a stalled job

Callbacks

Endpoint	Description
`POST /api/v1/callbacks/worker/`	Receive async Lambda result
`POST /api/v1/callbacks/test/`	Echo endpoint for testing delivery

Configuration

All configuration is via environment variables in .env:

Variable	Default	Description
`DATABASE_URL`	postgres://...	PostgreSQL connection string
`REDIS_URL`	redis://...	Redis connection string
`MONGO_URL`	mongodb://...	MongoDB connection string
`MONGO_DB`	scrapeops	MongoDB database name
`AWS_ACCESS_KEY_ID`	—	AWS credentials for Lambda
`AWS_SECRET_ACCESS_KEY`	—	AWS credentials for Lambda
`AWS_REGION`	us-east-1	AWS region
`GRAPHQL_LAMBDA_ARN`	—	ARN for GraphQL Lambda
`HIKER_LAMBDA_ARN`	—	ARN for Hiker Lambda
`DOM_LAMBDA_ARN`	—	ARN for DOM Lambda
`MAX_RETRIES_PER_LAYER`	2	Max retries before fallback
`DEDUP_CACHE_TTL`	3600	Seconds to cache successful job dedup keys
`JOB_TIMEOUT_SECONDS`	300	Celery task soft timeout
`CALLBACK_BASE_URL`	http://backend:8000	Internal base URL for Lambda callbacks

Mock Mode

When AWS credentials or Lambda ARNs are missing, the system runs in mock mode — simulated worker responses with realistic latency and failure rates:

GraphQL: ~40% rate_limit failure, 800–2000ms latency
Hiker: High success rate, 1500–3500ms latency
DOM: Fallback, always succeeds, 3000–6000ms latency

Mock mode is ideal for local development and testing the full orchestration loop without AWS.

Worker Response Contract

Every Lambda must return this exact structure:

{
  "job_id": "string (UUID)",
  "layer": "graphql | hiker | dom",
  "success": true,
  "error_type": "rate_limit | blocked | timeout | parsing_error | none",
  "error_message": "string",
  "comments_fetched": 120,
  "total_comments_available": 500,
  "time_taken_ms": 3420
}

Rules:

Workers must not perform retries or fallbacks
Workers must not call other layers
On failure, set success: false and populate error_type
time_taken_ms is the wall-clock time of the scrape attempt only

Project Structure

scrapeops/
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── manage.py
├── .env
│
├── scrapeops/              # Django project
│   ├── settings.py
│   ├── urls.py
│   ├── celery.py
│   └── wsgi.py
│
├── apps/
│   ├── jobs/               # Core job lifecycle + orchestration engine
│   │   ├── models.py       — ScrapingJob, JobEvent
│   │   ├── orchestration.py — Decision engine: retry/fallback/complete
│   │   ├── serializers.py
│   │   ├── views.py
│   │   ├── admin.py
│   │   └── urls.py
│   │
│   ├── workers/            # Execution plane interface
│   │   ├── invoker.py      — Lambda caller + mock mode
│   │   ├── tasks.py        — Celery dispatch + async response handling
│   │   ├── views.py
│   │   └── urls.py
│   │
│   ├── metrics/            # Analytics and observability
│   │   ├── views.py        — Overview, layer breakdown, errors, throughput
│   │   └── urls.py
│   │
│   └── callbacks/          # Outbound job callbacks + async worker receiver
│       ├── tasks.py        — HTTP delivery with exponential backoff retry
│       ├── views.py
│       └── urls.py
│
└── config/
    └── mongo.py            # MongoDB connection singleton

Connecting Real Lambda Workers

Deploy your Lambda with a single responsibility (GraphQL/Hiker/DOM only)
Set the ARN in .env

Lambda receives this payload:

{
  "job_id": "uuid",
  "layer": "graphql",
  "target_id": "instagram_post_id",
  "target_url": "https://...",
  "comments_requested": 150,
  "callback_url": "http://your-backend/api/v1/callbacks/worker/"
}

Lambda returns the worker response contract (see above)
For async invocation, POST the result to callback_url instead of returning it

Monitoring

Recommended stack:

OpenTelemetry — traces and spans across job lifecycle (SDK included in requirements)
Grafana — dashboards from PostgreSQL metrics
AWS CloudWatch — Lambda execution logs and error rates

Key metrics to alert on:

success_rate_pct < 80% over any 1h window
avg_job_time_ms > 15000
jobs_with_fallbacks / total_jobs > 50%
Celery queue depth > 500

Django Admin

Available at http://localhost:8000/admin/

Create a superuser:

docker compose exec backend python manage.py createsuperuser

The admin provides full job and event visibility, with inline event history per job.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
apps		apps
config		config
deploy		deploy
docs		docs
frontend		frontend
lambda_audit		lambda_audit
nginx		nginx
scrapeops		scrapeops
scripts		scripts
.env.ec2.example		.env.ec2.example
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
SCRAPEOPS.md		SCRAPEOPS.md
deploy_production.sh		deploy_production.sh
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
manage.py		manage.py
railway.toml		railway.toml
requirements.txt		requirements.txt
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScrapeOps — Scraper Orchestration Platform

Architecture

Fallback Chain

Quick Start

1. Clone & configure

2. Start all services

3. Verify

4. Browse API docs

API Reference

Jobs

Submit a job

Job with full event trail (after completion)

Metrics

Workers

Callbacks

Configuration

Mock Mode

Worker Response Contract

Project Structure

Connecting Real Lambda Workers

Monitoring

Django Admin

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ScrapeOps — Scraper Orchestration Platform

Architecture

Fallback Chain

Quick Start

1. Clone & configure

2. Start all services

3. Verify

4. Browse API docs

API Reference

Jobs

Submit a job

Job with full event trail (after completion)

Metrics

Workers

Callbacks

Configuration

Mock Mode

Worker Response Contract

Project Structure

Connecting Real Lambda Workers

Monitoring

Django Admin

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages