Search for historical mementos (archived copies) of a URL. Download metadata about the mementos and/or the memento content itself.
Instantiate a WaybackClient
.
python
from wayback import WaybackClient client = WaybackClient()
Search for all Wayback's records for nasa.gov.
python
results = client.search('nasa.gov')
This statement should execute fairly quickly because it doesn't actually do much work. The object we get back, results
, is a generator, a "lazy" object from which we can pull results, one at a time. As we pull items out of it, it loads them as needed from the Wayback Machine in chronological order. We can see that results
by itself is not informative:
python
results
There are couple ways to pull items out of generator like results
. One simple way is to use the built-in Python function next
, like so:
python
record = next(results)
This takes a moment to run because, now that we've asked to see the first item in the generator, this lazy object goes to fetch a chunk of results from the Wayback Machine. Looking at the record in detail,
python
record
we can find our answer: Wayback's first memento of nasa.gov was in 1996. We can use dot access on record
to access the timestamp specifically.
python
record.timestamp
Above, we access the metadata for the oldest memento on nasa.gov, stored in the variable record
. Starting from where we left off, we'll access the content of the memento and do a very simple analysis.
The Wayback Machine provides multiple playback modes to view the data it has captured. The wayback.Mode.view
mode is a copy edited for human viewers on the web, and the wayback.Mode.original
mode is the original copy of what was captured when the page was scraped. For analysis purposes, we generally want original
. (Check the documentation of wayback.Mode
for a few other, less commonly used modes.)
Let's download the original content using WaybackClient
. (You could download the content directly with an HTTP library like requests
, but WaybackClient
adds extra tools for dealing with Wayback Machine servers.)
python
from wayback import Mode
# Mode.original is the default and doesn't need to be explicitly set; # we've set it here to show how you might choose other modes. response = client.get_memento(record, mode=Mode.original) content = response.content.decode()
We can use the built-in method count
on strings to count the number of times that 'mars'
appears in the content.
python
content.count('mars')
This is case-sensitive, so to be more accurate we should convert the content to lowercase first.
python
content.lower().count('mars')
We picked up a couple additional occurrences that the original count missed.
The Wayback Machine exposes its data through two different mechanisms, implementing two different standards for archival data, the CDX API and the Memento API. We implement a Python client that can speak both.
wayback.WaybackClient
search
get_memento
wayback.CdxRecord
wayback.Memento
close
parse_memento_headers
wayback.WaybackSession
reset
wayback.memento_url_data
wayback.Mode
wayback.exceptions.WaybackException
wayback.exceptions.UnexpectedResponseFormat
wayback.exceptions.BlockedByRobotsError
wayback.exceptions.BlockedSiteError
wayback.exceptions.MementoPlaybackError
wayback.exceptions.RateLimitError
wayback.exceptions.WaybackRetryError
wayback.exceptions.SessionClosedError