Skip to content
a simple & tiny scrapy clustering solution, considered a drop-in replacement for scrapyd
Go Dockerfile
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
CODE_OF_CONDUCT.md
CONTRIBUTING.md
Dockerfile
LICENSE
README.md
etc.go
go.mod updated README and added go.mod files Oct 26, 2019
go.sum
jobs.go
main.go
redis_writer.go
routes.go
vars.go
watchers.go

README.md

scrapyd-go

an drop-in replacement for scrapyd that is more easy to be scalable and distributed on any number of commodity machines with no hassle, each scrapyd-go instance is a stateless microservice, all instances must be connected to the same redis server, redis is used as a ceneralized registry system for all instances, so each instance se what others see.

Why

scrapyd isn't bad, but it is very stateful, it isn't that easy to deploy it in a distributed environment like k8s, as well as I wanted to add more features, so I started this project as a drop-in replacement for scrapyd but writing in modern & scalable environment like go for restful server and redis as centeralized registry.

TODOs

  • schedule.json
  • cancel.json
  • addversion.json
  • listprojects.json
  • listversions.json
  • listspiders.json
  • delproject.json
  • delversion.json
  • listjobs.json
  • daemonstatus.json
  • logs/{jobid}, new: realtime output of the job log

Configurations

scrapyd-go configs are just simple command line flags

  -dir string
        the directory to use for local caching (default ".scrapyd-go")
  -listen string
        the address to bind to (default ":6800")
  -max2keep int
        the maximum jobs/logs to keep in memory (default 1000000)
  -poll int
        time in millisecond between each poll operation from queue(s) (default 10)
  -python string
        the python binary to use (default "python3")
  -redis string
        the redis server address (default "redis://:somepass@localhost:6379/1")
  -sync int
        time in seconds between each sync operation (default 15)
  -workers int
        the maximum workers count (default cpu-cores-count)

Installation

  • binary : go to releases page and download your os based release
  • docker: $ docker pull alash3al/scrapyd-go
  • source: $ go get github.com/alash3al/scrapyd-go

Running

  • binary: $ ./scrapyd_bin_file -redis redis://localhost:6379/1
  • docker: $ docker run --link SomeRedisServerContainer -p 6800:6800 alash3al/scrapyd-go -redis redis://SomeRedisServerContainer:6379/1
  • source: $ scrapyd-go -redis redis://localhost:6379/1

Contributing

  • Fork the repo
  • Create a feature branch
  • Push your changes
  • Create a pull request

License

Apache License v2.0

Author

You can’t perform that action at this time.