URL search: CDX server API
Pwa-Technologies software supports CDX Server API.
CDX-server API allows automatic access in order to list, sort, and filter preserved pages from a given URL.
The only required parameter to the cdx-server api is the url, eg: http://arquivo.pt/wayback/cdx?url=publico.pt
will return a list of captures for 'publico.pt'
from / to
Setting from= or to= will restrict the results to the given date/time range (inclusive).
Timestamps may be <=14 digits and will be padded to either lower or upper bound.
Example: http://arquivo.pt/wayback/cdx?url=sapo.pt&from=2014&to=2014 will return results of sapo.pt that have a timestamp between 20140101000000 and 20141231235959
The cdx-server supports the following matchType
exact -- default setting, will return captures that match the url exactly
prefix -- return captures that begin with a specified path, eg: http://sapo.pt/noticias/*
host -- return captures which for a begin host (the path segment is ignored if specified)
domain -- return captures for the current host and all subdomains, eg. *.example.com
Instead of specifying a separate matchType parameter, wildcards may be used in the url:
- ?url=http://www.sapo.pt/noticias/* is equivalent to ?url=http://www.sapo.pt/noticias/&matchType=prefix
- ?url=*.sapo.pt is equivalent to ?url=sapo.pt&matchType=domain
Setting limit= will limit the number of index lines returned. Limit must be set to a positive integer. If no limit is provided, all the matching lines are returned, which may be slow.
Example: http://arquivo.pt/wayback/cdx?url=http://www.sapo.pt/noticias/&matchType=prefix&limit=1500 will show the first 1500 results.
The sort param can be set as follows:
reverse: will sort the matching captures in reverse order. It is only recommended for exact query as reverse a large match may be very slow.
closest: setting this option also requires setting closest= where is a specific timestamp to sort by. This option will only work correctly for exact query and is useful for sorting captures based no time distance from a certain timestamp.
output (JSON output)
Setting output=json will return each line as a proper JSON dictionary. (Default format is text which will return the native format of the underlying CDX index, and may not be consistent). Using output=json is recommended for extensive analysis.
filter param can be specified multiple times to filter by specific fields in the cdx index. Field names correspond to the fields returned in the JSON output. Filters can be specified as follows:
Return captures from publico.pt/* where mime is text/html and http status is not 200.
! modifier before
=status indicates negation. The
~ modifiers are optional and specify exact resp. regular expression matches. The default (no specific modifier) is to filter whether the query string is contained in the field value. Negation and exact/regex modifier may be combined, eg.
The formal syntax is:
filter=<fieldname>:[!][=|~]<expression> with the following modifiers:
||field "mime" contains string "html"|
||exact match: field "mime" is "text/html"|
||regex match: expression matches beginning of field "mime" (cf. re.match)|
||field "mime" does not contain string "html"|
||field "mime" is not "text/html"|
||expression does not match beginning of field "mime"|
fl param can be used to specify which fields to include in the output. The standard available fields are:
Fields can be comma delimited, for example
?url=publico.pt&fl=url,timestamp,status will only include the
status in the output.