# Tracing SavePageNow

In order to better understand the internal workings of SavePageNow and how they get imprinted in the collected WARC data we ran a series of experiments on on October 25, 2018 to archive a discrete set of URLs with three clients: 

* the SavePageNow form at [https://web.archive.org/](https://web.archive.org)
* the [Wayback Machine Firefox Extension](https://addons.mozilla.org/en-US/firefox/addon/wayback-machine_new/)
* a Python bot [spn-probe](https://github.com/edsu/spn-probe/)

In each case we archived a distinct URL which identified the client, and the time so that we could reliably identify the relevant WARC records later. For example:

* web form: https://mith.umd.edu/research/?ua=firefox&t=20181024230000
* browser extension: https://mith.umd.edu/research/?ua=extension&t=20181024230000
* spn-probe: https://web.archive.org/save/https://mith.umd.edu/research/?ua=spn-probe&t=20181024140003

## Locating the Records

In order to locate the relevant records we could use the CDX files that Internet Archive have made available in addition to the WARC data. But we actually need to examine the request WARC records, but the CDX files do not index the requests. So we will have to look at the WARC data again.

But, rather than looking through all the WARC files for the desired requests we can at least use the CDX files to locate the WARC files that contain them. And then simply open those.

So first lets get the CDX files.

In [231]:
from glob import glob
cdx_files = glob('warcs/liveweb-2018*/*os.cdx.gz')
len(cdx_files)

925

In [201]:
from warc_spark import init, extractor

sc, sqlc = init()

In [250]:
import re
import gzip

def find_warcs(cdx_files):
    for cdx_file in cdx_files:
        fh = gzip.open(cdx_file, 'rt')
        header = next(fh)
        for line in fh:
            # parse CDX https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/
            line = line.strip()
            parts = line.split(" ")
            url = parts[2]
            filename = parts[10]        
            m = re.match(r'https://mith.umd.edu/research/\?ua=(.+)&t=(\d+)', url)
            if m:
                yield 'warcs/' + filename

In [245]:
cdxs = sc.parallelize(cdx_files)
warcs = cdxs.mapPartitions(find_warcs)

In [248]:
@extractor
def find_requests :
    if record.rec_type == 'request':
        url = record.rec_headers.get_header('WARC-Target-URI')
        m = re.match(r'https://mith.umd.edu/research/\?ua=(.+)&t=(\d+)', url)
        if m:
            yield record.http_headers.to_str()

output = warcs.mapPartitions(find_requests)
requests = output.collect()

In [249]:
for req in requests:
    print(req)
    print('')
    print('------')
    print('')

GET /research/?ua=spn-probe&t=20181025100003 HTTP/1.1
Accept: */*
User-Agent: spn-probe
Via: HTTP/1.0 web.archive.org (Wayback Save Page)
Connection: close
Host: mith.umd.edu
Accept-Encoding: gzip,deflate


------

GET /research/?ua=spn-probe&t=20181025010002 HTTP/1.1
Accept: */*
User-Agent: spn-probe
Via: HTTP/1.0 web.archive.org (Wayback Save Page)
Connection: close
Host: mith.umd.edu
Accept-Encoding: gzip,deflate


------

GET /research/?ua=spn-probe&t=20181025110002 HTTP/1.1
Accept: */*
User-Agent: spn-probe
Via: HTTP/1.0 web.archive.org (Wayback Save Page)
Connection: close
Host: mith.umd.edu
Accept-Encoding: gzip,deflate


------

GET /research/?ua=spn-probe&t=20181025180003 HTTP/1.1
Accept: */*
User-Agent: spn-probe
Via: HTTP/1.0 web.archive.org (Wayback Save Page)
Connection: close
Host: mith.umd.edu
Accept-Encoding: gzip,deflate


------

GET /research/?ua=spn-probe&t=20181025020002 HTTP/1.1
Accept: */*
User-Agent: spn-probe
Via: HTTP/1.0 web.arc

We can see that SPN (in 2018) added the `Via: HTTP/1.0 web.archive.org (Wayback Save Page)` HTTP request header for all different clients. Also, the User-Agent of the client is passed through unchanged.

But notice one significant difference when the SavePageNow web form is used in that there is a Referer set the `Referer` header:

    GET /research/?ua=firefox&t=20181025230000 HTTP/1.1
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language: en-US,en;q=0.5
    User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:61.0) Gecko/20100101 Firefox/61.0
    Via: HTTP/1.0 web.archive.org (Wayback Save Page)
    Referer: https://web.archive.org/
    Connection: close
    Host: mith.umd.edu
    Accept-Encoding: gzip,deflate
    
