cacher is a node.js + python HTTP API for html to text and caching pipeline.
node cacher.js listens for POST /app and POST /get HTTP requests at 127.0.0.1:8009
python worker.py works through the redis queue of urls to get from the web.
cp cacher.conf /etc/supervisor.d/
Or something like that.
htmlcache:queue (LIST of urls) htmlcache:fetched (SET of urls)
Files go in:
The file names are just the url, but with the protocol deleted and most non-word characters converted to dashes.
- replace all
- truncate to 255 characters
That directory should be made writeable by whatever user
worker.py is running under,
and at least readable by the user
cacher.js runs as.