Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
117 lines (76 sloc) 3.2 KB
Scrapeomat API docs
===================
GET http://<host>/<path>/api/slurp
Fetches full articles from the scrapeomat store.
PARAMETERS:
pubfrom
pubto
Only articles with publication dates within this range will be returned.
More specifically, the range is: pubfrom >= published < pubto
These can be days like "2006-03-23", or full RFC3339 dates, with the
timezone offset and all (eg: "2006-01-02T15:04:05+07:00").
For the day-only form, the day is taken as UTC. So, because London is
currently using BST, the articles returned will be skewed by one hour -
you'll be missing an hour from one day, but it'll include an hour from
another day instead.
Don't forget to url-escape the params (the plus sign in the timezone
caused me a little head-scratching ;-)
pub
filter by publication.
By default, all publications are included in the results, but if one or
more "pub" params are included, the results will be narrowed down.
The values for "pub" are the publication codes "bbc", "dailymail",
"guardian" etc etc...
(I can get you a list if you need them, or you can just pick them out
of the results yourself :-)
xpub
Exclude publications. Any publications specified with xpub will
be filtered out.
since_id
Only return articles with an internal ID larger than this.
count
limit the returned set of articles to this many at most.
There'll be some internal limit, which will probably end
up at about 2000 or so.
EXAMPLE:
to fetch all the articles published on May 3rd, London time (+01:00
currently):
http://foo.scumways.com/ukarts/api/slurp?pubfrom=2015-05-03T00%3A00%3A00%2B01%3A00&pubto=2015-05-04T00%3A00%3A00%2B01%3A00
RETURNS:
Upon error, a non-200 HTTP code will be returned (eg "400 Bad Request"
if the parameters are bad).
Upon success, the articles are returned as a stream of json
objects:
{"article": { ... article 1 data ... }}
{"article": { ... article 2 data ... }}
...
{"article": { ... article N data ... }}
If an error occurs after the data starts flowing, an error object will be
returned with some description, eg:
{"error": "too many fish"}
I plan to define some other objects in addtion to "article" and "error"
(eg progress updates), so if you just ignore anything unknown you should
be fine.
The article data should be reasonably self-explanatory.
The "content" field is the article text, in somewhat-sanitised HTML.
The "urls" field contains a list of known URLs (including canonical URL,
if known).
If the results were clipped, the last object returned will be:
{"next": {"since_id": N}}
where N is the ID of the highest received article, which can be used
as a parameter in the next request.
FUTURE PLANS:
- Some sort of simple token-based auth.
- Other API endpoints for interogating publication
codes, article counts and whatever other stats or diagnostic stuff would be
useful.
METHOD:
GET /api/pubs
PARAMETERS:
none
RETURNS
json object with one member, "publications".
"publications" is a list of the publications in the DB, each with the fields:
code - short code (lowercase) for publication (eg "dailyblah")
name - human-readable name of publication (eg "The Daily Blah")
domain - main domain for publication (eg "www.dailyblah.com")
You can’t perform that action at this time.