cmd/slurpserver/api.txt

Scrapeomat API docs
===================

GET http://<host>/<path>/api/slurp

Fetches full articles from the scrapeomat store.


PARAMETERS:

pubfrom
pubto

  Only articles with publication dates within this range will be returned.
  More specifically, the range is:   pubfrom >= published < pubto

  These can be days like "2006-03-23", or full RFC3339 dates, with the
  timezone offset and all (eg: "2006-01-02T15:04:05+07:00"). 

  For the day-only form, the day is taken as UTC. So, because London is
  currently using BST, the articles returned will be skewed by one hour -
  you'll be missing an hour from one day, but it'll include an hour from
  another day instead.

  Don't forget to url-escape the params (the plus sign in the timezone
  caused me a little head-scratching ;-)

pub
  filter by publication.

  By default, all publications are included in the results, but if one or
  more "pub" params are included, the results will be narrowed down.
  The values for "pub" are the publication codes "bbc", "dailymail",
  "guardian" etc etc...
  (I can get you a list if you need them, or you can just pick them out
  of the results yourself :-)

xpub
  Exclude publications. Any publications specified with xpub will
  be filtered out.


since_id
  Only return articles with an internal ID larger than this.

count
  limit the returned set of articles to this many at most.
  There'll be some internal limit, which will probably end
  up at about 2000 or so.


EXAMPLE:
  to fetch all the articles published on May 3rd, London time (+01:00
  currently):

  http://foo.scumways.com/ukarts/api/slurp?pubfrom=2015-05-03T00%3A00%3A00%2B01%3A00&pubto=2015-05-04T00%3A00%3A00%2B01%3A00


RETURNS:

Upon error, a non-200 HTTP code will be returned (eg "400 Bad Request"
if the parameters are bad).

Upon success, the articles are returned as a stream of json
objects:

  {"article": { ... article 1 data ... }}
  {"article": { ... article 2 data ... }}
     ...
  {"article": { ... article N data ... }}

If an error occurs after the data starts flowing, an error object will be
returned with some description, eg:

  {"error": "too many fish"}

I plan to define some other objects in addtion to "article" and "error"
(eg progress updates), so if you just ignore anything unknown you should
be fine.

The article data should be reasonably self-explanatory.
The "content" field is the article text, in somewhat-sanitised HTML.
The "urls" field contains a list of known URLs (including canonical URL,
if known).

If the results were clipped, the last object returned will be:
  {"next": {"since_id": N}}
where N is the ID of the highest received article, which can be used
as a parameter in the next request.


FUTURE PLANS:

- Some sort of simple token-based auth.

- Other API endpoints for interogating publication
codes, article counts and whatever other stats or diagnostic stuff would be
useful.


METHOD:
GET /api/pubs

PARAMETERS:
none

RETURNS
json object with one member, "publications".
"publications" is a list of the publications in the DB, each with the fields:
code    - short code (lowercase) for publication (eg "dailyblah")
name    - human-readable name of publication (eg "The Daily Blah")
domain  - main domain for publication  (eg "www.dailyblah.com")