Skip to content

Latest commit

 

History

History
149 lines (93 loc) · 4.19 KB

usage.rst

File metadata and controls

149 lines (93 loc) · 4.19 KB

Usage

Search for historical mementos (archived copies) of a URL. Download metadata about the mementos and/or the memento content itself.

Tutorial

What is the earliest memento of nasa.gov?

Instantiate a WaybackClient.

python

from wayback import WaybackClient client = WaybackClient()

Search for all Wayback's records for nasa.gov.

python

results = client.search('nasa.gov')

This statement should execute fairly quickly because it doesn't actually do much work. The object we get back, results, is a generator, a "lazy" object from which we can pull results, one at a time. As we pull items out of it, it loads them as needed from the Wayback Machine in chronological order. We can see that results by itself is not informative:

python

results

There are couple ways to pull items out of generator like results. One simple way is to use the built-in Python function next, like so:

python

record = next(results)

This takes a moment to run because, now that we've asked to see the first item in the generator, this lazy object goes to fetch a chunk of results from the Wayback Machine. Looking at the record in detail,

python

record

we can find our answer: Wayback's first memento of nasa.gov was in 1996. We can use dot access on record to access the timestamp specifically.

python

record.timestamp

How many times does the word 'mars' appear on nasa.gov?

Above, we access the metadata for the oldest memento on nasa.gov, stored in the variable record. Starting from where we left off, we'll access the content of the memento and do a very simple analysis.

The Wayback Machine provides multiple playback modes to view the data it has captured. The wayback.Mode.view mode is a copy edited for human viewers on the web, and the wayback.Mode.original mode is the original copy of what was captured when the page was scraped. For analysis purposes, we generally want original. (Check the documentation of wayback.Mode for a few other, less commonly used modes.)

Let's download the original content using WaybackClient. (You could download the content directly with an HTTP library like requests, but WaybackClient adds extra tools for dealing with Wayback Machine servers.)

python

from wayback import Mode

# Mode.original is the default and doesn't need to be explicitly set; # we've set it here to show how you might choose other modes. response = client.get_memento(record, mode=Mode.original) content = response.content.decode()

We can use the built-in method count on strings to count the number of times that 'mars' appears in the content.

python

content.count('mars')

This is case-sensitive, so to be more accurate we should convert the content to lowercase first.

python

content.lower().count('mars')

We picked up a couple additional occurrences that the original count missed.

API Documentation

The Wayback Machine exposes its data through two different mechanisms, implementing two different standards for archival data, the CDX API and the Memento API. We implement a Python client that can speak both.

wayback.WaybackClient

search

get_memento

wayback.CdxRecord

wayback.Memento

close

parse_memento_headers

wayback.WaybackSession

reset

Utilities

wayback.memento_url_data

wayback.Mode

Exception Classes

wayback.exceptions.WaybackException

wayback.exceptions.UnexpectedResponseFormat

wayback.exceptions.BlockedByRobotsError

wayback.exceptions.BlockedSiteError

wayback.exceptions.MementoPlaybackError

wayback.exceptions.RateLimitError

wayback.exceptions.WaybackRetryError

wayback.exceptions.SessionClosedError