# Tracing SavePageNow

In order to better understand the internal workings of SavePageNow and how they get imprinted in the collected WARC data we ran a series of experiments on on October 25, 2018 to archive a discrete set of URLs with three clients: 

* the SavePageNow form at [https://web.archive.org/](https://web.archive.org)
* the [Wayback Machine Firefox Extension](https://addons.mozilla.org/en-US/firefox/addon/wayback-machine_new/)
* a Python bot [spn-probe](https://github.com/edsu/spn-probe/)

In each case we archived a distinct URL which identified the client, and the time so that we could reliably identify the relevant WARC records later. For example:

* web form: https://mith.umd.edu/research/?ua=firefox&t=20181024230000
* browser extension: https://mith.umd.edu/research/?ua=extension&t=20181024230000
* spn-probe: https://web.archive.org/save/https://mith.umd.edu/research/?ua=spn-probe&t=20181024140003

## Locating the Records

In order to locate the relevant records we could use the CDX files that Internet Archive have made available in addition to the WARC data. But we actually need to examine the request WARC records, but the CDX files do not index the requests. So we will have to look at the WARC data again.

In [110]:
from glob import glob
cdx_files = glob('warcs/liveweb-2018*/*os.cdx.gz')
len(cdx_files)

925

In [116]:
from warc_spark import init, extractor

sc, sqlc = init()

In [170]:
import re
import gzip
import zlib

def lookup(cdx_files):

    for cdx_file in cdx_files:
        fh = gzip.open(cdx_file, 'rt')
        header = next(fh)
    
        #  CDX N b a m s k r M S V g

        for line in fh:
            line = line.strip()

            parts = line.split(" ")
            url = parts[2]
            filename = parts[10]

            size = int(parts[8])
            offset = int(parts[9])
        
            m = re.match(r'https://mith.umd.edu/research/\?ua=(.+)&t=(\d+)', url)
            if m:
                fh = open('warcs/' + filename, 'rb')
                fh.seek(offset)
                bits = fh.read(size)
                s = zlib.decompress(bits, zlib.MAX_WBITS|16).decode('utf8')
                head = s.split("\r\n\r\n")[0]
                yield {"client": m.group(1), "time": m.group(2), "record": head}
                

In [171]:
cdxs = sc.parallelize(cdx_files)
output = cdxs.mapPartitions(lookup)
results = output.collect()


In [175]:
for result in results:
    print("client: {}".format(result['client']))
    print("time: {}".format(result['time']))
    print('')
    print(result['record'])
    print('')
    print('------')
    print('')

client: spn-probe
time: 20181025100003

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:4b7592bb-5342-4daf-a35b-d829c926840d>
WARC-Date: 2018-10-25T10:00:04Z
Content-Length: 682352
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha1:4PCAOET735ETDJHYI6UV2FPSNKRJEYKF
WARC-Target-URI: https://mith.umd.edu/research/?ua=spn-probe&t=20181025100003
WARC-Warcinfo-ID: <urn:uuid:6cf1c288-dc35-43e0-901e-645087b70ef2>

------

client: spn-probe
time: 20181025010002

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:2c7f38ac-baed-4408-a207-8c8d5c74c1fe>
WARC-Date: 2018-10-25T01:00:04Z
Content-Length: 682356
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha1:HZIVWC2SJUJZXVORQTE5I76NMF43HOQY
WARC-Target-URI: https://mith.umd.edu/research/?ua=spn-probe&t=20181025010002
WARC-Warcinfo-ID: <urn:uuid:a4a9e6bc-1df0-44aa-a899-a424eebf9bef>

------

client: spn-probe
time: 20181025110002

WARC/1.0
WARC-Type: response
WARC-Record-ID: