Skip to content

amanncodes/ScrapeOrchestrator

Repository files navigation

ScrapeOps — Scraper Orchestration Platform

A production-ready Django backend that acts as the control plane for all scraping infrastructure. It orchestrates scraping workflows across multiple execution layers (GraphQL → Hiker → DOM), tracks every retry and fallback, and provides full observability into pipeline performance.


Architecture

Client Request
      │
      ▼
Django Backend (Control Plane)
      ├── Redis      — deduplication cache + Celery broker
      ├── PostgreSQL — job state, events, metrics
      ├── MongoDB    — raw worker payloads
      │
      ▼
Celery Worker (dispatch_worker_task)
      │
      ▼
AWS Lambda Workers (Execution Plane)
      ├── GraphQL Lambda   — GraphQL scraping only
      ├── Hiker Lambda     — Hiker API scraping only
      └── DOM Lambda       — Browser-based scraping only
            │
            ▼
      Worker Response → Django Orchestration Engine
                              │
                        ┌─────┴──────┐
                     success?     failure?
                        │              │
                   complete      retry / fallback
                        │              │
                   callback      next layer

Fallback Chain

GraphQL  →(fail)→  Hiker  →(fail)→  DOM  →(fail)→  Job FAILED

Immediate fallback errors (no retry on same layer): rate_limit, blocked Retry-eligible errors: timeout, parsing_error, unknown


Quick Start

1. Clone & configure

git clone <repo>
cd scrapeops
cp .env .env.local   # edit with your values

2. Start all services

docker compose up --build

This starts:

  • scrapeops_postgres on port 5433
  • scrapeops_redis on port 6380
  • scrapeops_mongo on port 27018
  • scrapeops_backend on port 8000 (auto-migrates on start)
  • scrapeops_worker — Celery worker (scraping + callbacks queues)
  • scrapeops_beat — Celery beat scheduler

3. Verify

curl http://localhost:8000/api/v1/workers/health/

Expected:

{
  "mock_mode": true,
  "layers": ["graphql", "hiker", "dom"],
  "lambda_arns": {"graphql": false, "hiker": false, "dom": false}
}

4. Browse API docs

Open: http://localhost:8000/api/docs/


API Reference

Jobs

Method Endpoint Description
POST /api/v1/jobs/submit/ Submit a new scraping job
GET /api/v1/jobs/ List all jobs
GET /api/v1/jobs/{id}/ Get job detail + event log
POST /api/v1/jobs/{id}/cancel/ Cancel a running job
GET /api/v1/jobs/{id}/events/ Get job event timeline
GET /api/v1/jobs/stats/ Aggregate job statistics

Submit a job

curl -X POST http://localhost:8000/api/v1/jobs/submit/ \
  -H "Content-Type: application/json" \
  -d '{
    "target_type": "instagram_post",
    "target_id": "CxYz123abc",
    "target_url": "https://www.instagram.com/p/CxYz123abc/",
    "comments_requested": 200,
    "max_retries": 2,
    "callback_url": "https://your-app.com/webhooks/scrapeops"
  }'

Response:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "current_layer": "graphql",
  "comments_requested": 200,
  "comments_fetched": 0,
  "retry_count": 0,
  "fallback_count": 0,
  "events": []
}

Job with full event trail (after completion)

{
  "id": "550e8400-...",
  "status": "success",
  "completed_layer": "hiker",
  "comments_fetched": 200,
  "retry_count": 1,
  "fallback_count": 1,
  "total_time_ms": 5840,
  "events": [
    {
      "layer": "graphql",
      "status": "failed",
      "error_type": "rate_limit",
      "comments_fetched": 0,
      "time_taken_ms": 320
    },
    {
      "layer": "hiker",
      "status": "success",
      "error_type": "none",
      "comments_fetched": 200,
      "time_taken_ms": 2800
    }
  ]
}

Metrics

All endpoints accept ?hours=N (default: 24).

Endpoint Description
GET /api/v1/metrics/overview/ Success rate, avg time, comment throughput
GET /api/v1/metrics/layers/ Per-layer success rates and avg time
GET /api/v1/metrics/errors/ Error type frequency breakdown
GET /api/v1/metrics/throughput/ Comments fetched per hour
GET /api/v1/metrics/retries/ Retry and fallback analysis
# Last 48 hours of layer performance
curl "http://localhost:8000/api/v1/metrics/layers/?hours=48"

Workers

Endpoint Description
GET /api/v1/workers/health/ Worker config and mock mode status
POST /api/v1/workers/redispatch/ Manually re-dispatch a stalled job

Callbacks

Endpoint Description
POST /api/v1/callbacks/worker/ Receive async Lambda result
POST /api/v1/callbacks/test/ Echo endpoint for testing delivery

Configuration

All configuration is via environment variables in .env:

Variable Default Description
DATABASE_URL postgres://... PostgreSQL connection string
REDIS_URL redis://... Redis connection string
MONGO_URL mongodb://... MongoDB connection string
MONGO_DB scrapeops MongoDB database name
AWS_ACCESS_KEY_ID AWS credentials for Lambda
AWS_SECRET_ACCESS_KEY AWS credentials for Lambda
AWS_REGION us-east-1 AWS region
GRAPHQL_LAMBDA_ARN ARN for GraphQL Lambda
HIKER_LAMBDA_ARN ARN for Hiker Lambda
DOM_LAMBDA_ARN ARN for DOM Lambda
MAX_RETRIES_PER_LAYER 2 Max retries before fallback
DEDUP_CACHE_TTL 3600 Seconds to cache successful job dedup keys
JOB_TIMEOUT_SECONDS 300 Celery task soft timeout
CALLBACK_BASE_URL http://backend:8000 Internal base URL for Lambda callbacks

Mock Mode

When AWS credentials or Lambda ARNs are missing, the system runs in mock mode — simulated worker responses with realistic latency and failure rates:

  • GraphQL: ~40% rate_limit failure, 800–2000ms latency
  • Hiker: High success rate, 1500–3500ms latency
  • DOM: Fallback, always succeeds, 3000–6000ms latency

Mock mode is ideal for local development and testing the full orchestration loop without AWS.


Worker Response Contract

Every Lambda must return this exact structure:

{
  "job_id": "string (UUID)",
  "layer": "graphql | hiker | dom",
  "success": true,
  "error_type": "rate_limit | blocked | timeout | parsing_error | none",
  "error_message": "string",
  "comments_fetched": 120,
  "total_comments_available": 500,
  "time_taken_ms": 3420
}

Rules:

  • Workers must not perform retries or fallbacks
  • Workers must not call other layers
  • On failure, set success: false and populate error_type
  • time_taken_ms is the wall-clock time of the scrape attempt only

Project Structure

scrapeops/
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── manage.py
├── .env
│
├── scrapeops/              # Django project
│   ├── settings.py
│   ├── urls.py
│   ├── celery.py
│   └── wsgi.py
│
├── apps/
│   ├── jobs/               # Core job lifecycle + orchestration engine
│   │   ├── models.py       — ScrapingJob, JobEvent
│   │   ├── orchestration.py — Decision engine: retry/fallback/complete
│   │   ├── serializers.py
│   │   ├── views.py
│   │   ├── admin.py
│   │   └── urls.py
│   │
│   ├── workers/            # Execution plane interface
│   │   ├── invoker.py      — Lambda caller + mock mode
│   │   ├── tasks.py        — Celery dispatch + async response handling
│   │   ├── views.py
│   │   └── urls.py
│   │
│   ├── metrics/            # Analytics and observability
│   │   ├── views.py        — Overview, layer breakdown, errors, throughput
│   │   └── urls.py
│   │
│   └── callbacks/          # Outbound job callbacks + async worker receiver
│       ├── tasks.py        — HTTP delivery with exponential backoff retry
│       ├── views.py
│       └── urls.py
│
└── config/
    └── mongo.py            # MongoDB connection singleton

Connecting Real Lambda Workers

  1. Deploy your Lambda with a single responsibility (GraphQL/Hiker/DOM only)
  2. Set the ARN in .env
  3. Lambda receives this payload:
    {
      "job_id": "uuid",
      "layer": "graphql",
      "target_id": "instagram_post_id",
      "target_url": "https://...",
      "comments_requested": 150,
      "callback_url": "http://your-backend/api/v1/callbacks/worker/"
    }
  4. Lambda returns the worker response contract (see above)
  5. For async invocation, POST the result to callback_url instead of returning it

Monitoring

Recommended stack:

  • OpenTelemetry — traces and spans across job lifecycle (SDK included in requirements)
  • Grafana — dashboards from PostgreSQL metrics
  • AWS CloudWatch — Lambda execution logs and error rates

Key metrics to alert on:

  • success_rate_pct < 80% over any 1h window
  • avg_job_time_ms > 15000
  • jobs_with_fallbacks / total_jobs > 50%
  • Celery queue depth > 500

Django Admin

Available at http://localhost:8000/admin/

Create a superuser:

docker compose exec backend python manage.py createsuperuser

The admin provides full job and event visibility, with inline event history per job.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors