A production-ready Django backend that acts as the control plane for all scraping infrastructure. It orchestrates scraping workflows across multiple execution layers (GraphQL → Hiker → DOM), tracks every retry and fallback, and provides full observability into pipeline performance.
Client Request
│
▼
Django Backend (Control Plane)
├── Redis — deduplication cache + Celery broker
├── PostgreSQL — job state, events, metrics
├── MongoDB — raw worker payloads
│
▼
Celery Worker (dispatch_worker_task)
│
▼
AWS Lambda Workers (Execution Plane)
├── GraphQL Lambda — GraphQL scraping only
├── Hiker Lambda — Hiker API scraping only
└── DOM Lambda — Browser-based scraping only
│
▼
Worker Response → Django Orchestration Engine
│
┌─────┴──────┐
success? failure?
│ │
complete retry / fallback
│ │
callback next layer
GraphQL →(fail)→ Hiker →(fail)→ DOM →(fail)→ Job FAILED
Immediate fallback errors (no retry on same layer): rate_limit, blocked
Retry-eligible errors: timeout, parsing_error, unknown
git clone <repo>
cd scrapeops
cp .env .env.local # edit with your valuesdocker compose up --buildThis starts:
scrapeops_postgreson port 5433scrapeops_redison port 6380scrapeops_mongoon port 27018scrapeops_backendon port 8000 (auto-migrates on start)scrapeops_worker— Celery worker (scraping + callbacks queues)scrapeops_beat— Celery beat scheduler
curl http://localhost:8000/api/v1/workers/health/Expected:
{
"mock_mode": true,
"layers": ["graphql", "hiker", "dom"],
"lambda_arns": {"graphql": false, "hiker": false, "dom": false}
}Open: http://localhost:8000/api/docs/
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/jobs/submit/ |
Submit a new scraping job |
GET |
/api/v1/jobs/ |
List all jobs |
GET |
/api/v1/jobs/{id}/ |
Get job detail + event log |
POST |
/api/v1/jobs/{id}/cancel/ |
Cancel a running job |
GET |
/api/v1/jobs/{id}/events/ |
Get job event timeline |
GET |
/api/v1/jobs/stats/ |
Aggregate job statistics |
curl -X POST http://localhost:8000/api/v1/jobs/submit/ \
-H "Content-Type: application/json" \
-d '{
"target_type": "instagram_post",
"target_id": "CxYz123abc",
"target_url": "https://www.instagram.com/p/CxYz123abc/",
"comments_requested": 200,
"max_retries": 2,
"callback_url": "https://your-app.com/webhooks/scrapeops"
}'Response:
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "running",
"current_layer": "graphql",
"comments_requested": 200,
"comments_fetched": 0,
"retry_count": 0,
"fallback_count": 0,
"events": []
}{
"id": "550e8400-...",
"status": "success",
"completed_layer": "hiker",
"comments_fetched": 200,
"retry_count": 1,
"fallback_count": 1,
"total_time_ms": 5840,
"events": [
{
"layer": "graphql",
"status": "failed",
"error_type": "rate_limit",
"comments_fetched": 0,
"time_taken_ms": 320
},
{
"layer": "hiker",
"status": "success",
"error_type": "none",
"comments_fetched": 200,
"time_taken_ms": 2800
}
]
}All endpoints accept ?hours=N (default: 24).
| Endpoint | Description |
|---|---|
GET /api/v1/metrics/overview/ |
Success rate, avg time, comment throughput |
GET /api/v1/metrics/layers/ |
Per-layer success rates and avg time |
GET /api/v1/metrics/errors/ |
Error type frequency breakdown |
GET /api/v1/metrics/throughput/ |
Comments fetched per hour |
GET /api/v1/metrics/retries/ |
Retry and fallback analysis |
# Last 48 hours of layer performance
curl "http://localhost:8000/api/v1/metrics/layers/?hours=48"| Endpoint | Description |
|---|---|
GET /api/v1/workers/health/ |
Worker config and mock mode status |
POST /api/v1/workers/redispatch/ |
Manually re-dispatch a stalled job |
| Endpoint | Description |
|---|---|
POST /api/v1/callbacks/worker/ |
Receive async Lambda result |
POST /api/v1/callbacks/test/ |
Echo endpoint for testing delivery |
All configuration is via environment variables in .env:
| Variable | Default | Description |
|---|---|---|
DATABASE_URL |
postgres://... | PostgreSQL connection string |
REDIS_URL |
redis://... | Redis connection string |
MONGO_URL |
mongodb://... | MongoDB connection string |
MONGO_DB |
scrapeops | MongoDB database name |
AWS_ACCESS_KEY_ID |
— | AWS credentials for Lambda |
AWS_SECRET_ACCESS_KEY |
— | AWS credentials for Lambda |
AWS_REGION |
us-east-1 | AWS region |
GRAPHQL_LAMBDA_ARN |
— | ARN for GraphQL Lambda |
HIKER_LAMBDA_ARN |
— | ARN for Hiker Lambda |
DOM_LAMBDA_ARN |
— | ARN for DOM Lambda |
MAX_RETRIES_PER_LAYER |
2 | Max retries before fallback |
DEDUP_CACHE_TTL |
3600 | Seconds to cache successful job dedup keys |
JOB_TIMEOUT_SECONDS |
300 | Celery task soft timeout |
CALLBACK_BASE_URL |
http://backend:8000 | Internal base URL for Lambda callbacks |
When AWS credentials or Lambda ARNs are missing, the system runs in mock mode — simulated worker responses with realistic latency and failure rates:
- GraphQL: ~40% rate_limit failure, 800–2000ms latency
- Hiker: High success rate, 1500–3500ms latency
- DOM: Fallback, always succeeds, 3000–6000ms latency
Mock mode is ideal for local development and testing the full orchestration loop without AWS.
Every Lambda must return this exact structure:
{
"job_id": "string (UUID)",
"layer": "graphql | hiker | dom",
"success": true,
"error_type": "rate_limit | blocked | timeout | parsing_error | none",
"error_message": "string",
"comments_fetched": 120,
"total_comments_available": 500,
"time_taken_ms": 3420
}Rules:
- Workers must not perform retries or fallbacks
- Workers must not call other layers
- On failure, set
success: falseand populateerror_type time_taken_msis the wall-clock time of the scrape attempt only
scrapeops/
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── manage.py
├── .env
│
├── scrapeops/ # Django project
│ ├── settings.py
│ ├── urls.py
│ ├── celery.py
│ └── wsgi.py
│
├── apps/
│ ├── jobs/ # Core job lifecycle + orchestration engine
│ │ ├── models.py — ScrapingJob, JobEvent
│ │ ├── orchestration.py — Decision engine: retry/fallback/complete
│ │ ├── serializers.py
│ │ ├── views.py
│ │ ├── admin.py
│ │ └── urls.py
│ │
│ ├── workers/ # Execution plane interface
│ │ ├── invoker.py — Lambda caller + mock mode
│ │ ├── tasks.py — Celery dispatch + async response handling
│ │ ├── views.py
│ │ └── urls.py
│ │
│ ├── metrics/ # Analytics and observability
│ │ ├── views.py — Overview, layer breakdown, errors, throughput
│ │ └── urls.py
│ │
│ └── callbacks/ # Outbound job callbacks + async worker receiver
│ ├── tasks.py — HTTP delivery with exponential backoff retry
│ ├── views.py
│ └── urls.py
│
└── config/
└── mongo.py # MongoDB connection singleton
- Deploy your Lambda with a single responsibility (GraphQL/Hiker/DOM only)
- Set the ARN in
.env - Lambda receives this payload:
{ "job_id": "uuid", "layer": "graphql", "target_id": "instagram_post_id", "target_url": "https://...", "comments_requested": 150, "callback_url": "http://your-backend/api/v1/callbacks/worker/" } - Lambda returns the worker response contract (see above)
- For async invocation, POST the result to
callback_urlinstead of returning it
Recommended stack:
- OpenTelemetry — traces and spans across job lifecycle (SDK included in requirements)
- Grafana — dashboards from PostgreSQL metrics
- AWS CloudWatch — Lambda execution logs and error rates
Key metrics to alert on:
success_rate_pct < 80%over any 1h windowavg_job_time_ms > 15000jobs_with_fallbacks / total_jobs > 50%- Celery queue depth > 500
Available at http://localhost:8000/admin/
Create a superuser:
docker compose exec backend python manage.py createsuperuserThe admin provides full job and event visibility, with inline event history per job.