Skip to content
Switch branches/tags
Go to file
Cannot retrieve contributors at this time
Scrapeomat API docs
GET http://<host>/<path>/api/slurp
Fetches full articles from the scrapeomat store.
Only articles with publication dates within this range will be returned.
More specifically, the range is: pubfrom >= published < pubto
These can be days like "2006-03-23", or full RFC3339 dates, with the
timezone offset and all (eg: "2006-01-02T15:04:05+07:00").
For the day-only form, the day is taken as UTC. So, because London is
currently using BST, the articles returned will be skewed by one hour -
you'll be missing an hour from one day, but it'll include an hour from
another day instead.
Don't forget to url-escape the params (the plus sign in the timezone
caused me a little head-scratching ;-)
filter by publication.
By default, all publications are included in the results, but if one or
more "pub" params are included, the results will be narrowed down.
The values for "pub" are the publication codes "bbc", "dailymail",
"guardian" etc etc...
(I can get you a list if you need them, or you can just pick them out
of the results yourself :-)
Exclude publications. Any publications specified with xpub will
be filtered out.
Only return articles with an internal ID larger than this.
limit the returned set of articles to this many at most.
There'll be some internal limit, which will probably end
up at about 2000 or so.
to fetch all the articles published on May 3rd, London time (+01:00
Upon error, a non-200 HTTP code will be returned (eg "400 Bad Request"
if the parameters are bad).
Upon success, the articles are returned as a stream of json
{"article": { ... article 1 data ... }}
{"article": { ... article 2 data ... }}
{"article": { ... article N data ... }}
If an error occurs after the data starts flowing, an error object will be
returned with some description, eg:
{"error": "too many fish"}
I plan to define some other objects in addtion to "article" and "error"
(eg progress updates), so if you just ignore anything unknown you should
be fine.
The article data should be reasonably self-explanatory.
The "content" field is the article text, in somewhat-sanitised HTML.
The "urls" field contains a list of known URLs (including canonical URL,
if known).
If the results were clipped, the last object returned will be:
{"next": {"since_id": N}}
where N is the ID of the highest received article, which can be used
as a parameter in the next request.
- Some sort of simple token-based auth.
- Other API endpoints for interogating publication
codes, article counts and whatever other stats or diagnostic stuff would be
GET /api/pubs
json object with one member, "publications".
"publications" is a list of the publications in the DB, each with the fields:
code - short code (lowercase) for publication (eg "dailyblah")
name - human-readable name of publication (eg "The Daily Blah")
domain - main domain for publication (eg "")