# URLs in Wayback SPN Data

In addition to looking at popular host names it also could be useful to identify popular URLs that people (or bots) archived on each day. Were there attempts to archive multiple things on the same day, and what can we possibly infer about the significance of these multiple attempts?

The trouble is that when a browser interacts with SavePageNow via the [web form](https://web.archive.org) it receive the HTML for the requested webpage which has been rewritten to include some JavaScript. This JavaScript gets the browser to request any additional resources that are needed for rendering the page (JavaScript, images, CSS, etc) through SavePageNow as well. This means that a more high-fidelity recording is made, since all the resources for a web page are needed to make it human readable.

Some of these URLs may be for things like jQuery a Content Deliver Network, or a CSS file. These aren't terribly interesting in terms of this analysis which is attempting to find duplicates in the originally requested page. One thing we can do is limit our analysis to HTML pages, or requests that come back 200 OK with a `Content-Type` HTTP header containing text/html.

In [1]:
from warc_spark import init, extractor

sc, sqlc = init()

In [107]:
import re
from urllib.parse import urlparse

@extractor
def html_url(record):
    
    if record.rec_type == 'request':
        id = record.rec_headers.get_header('WARC-Concurrent-To')
        ua = record.http_headers.get('user-agent')
        if id and ua:
            yield (id, ua)
        
    elif record.rec_type == 'response' and 'html' in record.http_headers.get('content-type', ''):
        id = record.rec_headers.get_header('WARC-Record-ID')
        url = record.rec_headers.get_header('WARC-Target-URI')
        status_code = record.http_headers.get_statuscode()
        
        uri = urlparse(url)        
        is_dependency = re.match(r'.*\.(gif|jpg|jpeg|js|png|css)$', uri.path)
        if not is_dependency and status_code == '200' and id and url:
            yield {id, url}

In [141]:
from glob import glob

warc_files = glob('warcs/*/*.warc.gz')[1:2]
warcs = sc.parallelize(warc_files)
output = warcs.mapPartitions(html_url)
grouped = output.groupByKey()

print(warc_files)

['warcs/liveweb-20131025181033/live-20131025000757-00002-wwwb-app19.us.archive.org.warc.gz']


In [142]:
x = grouped.mapValues(list)
x = x.filter(lambda r: len(r[1]) == 2)

x = x.map(lambda r: r[1][1])
x = x.countByValue()
x = x.filter(lambda r: r[1] > 1)
x

#import json
#ua_families = json.load(open('../analysis/results/ua-families.json'))
#top_uas = json.load(open('../analysis/results/top-uas.json'))

#x = x.map(lambda d: (
#    d[0],
#    d[1][1],
#    d[1][0], 
#    ua_families.get(d[1][0], ''),
#))


AttributeError: 'collections.defaultdict' object has no attribute 'filter'

In [131]:
x[1].countByValue()

#df = x.toDF(['id', 'url', 'ua', 'ua-family'])
#df.url

TypeError: 'PipelinedRDD' object is not subscriptable

In [129]:
x

defaultdict(int,
            {('<urn:uuid:06961421-4581-408c-92cd-06caacdf5d9e>',
              'http://elmansour-host.com/record/_embed/http://elmansour-host.com?0f41fdd8c9af62a4027fda82d597062a=1623525913&a76908a5258476fc14b355b5afd3bd28=2344573&f67ff3454420ebb255fe4f673cf8cf0b=AwZ1ZGtlAwN3Zmx0AQNkZwx1-1',
              'Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot)',
              'archive.org_bot'): 1,
             ('<urn:uuid:231f60e1-c9ca-46f8-8924-3a732145ce00>',
              'http://www.eventbrite.com/tickets-external?eid=3804348910&ref=etckt',
              'Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot)',
              'archive.org_bot'): 1,
             ('<urn:uuid:98572061-58a2-40c2-a386-2babdde3991e>',
              'http://www.youtube.com/embed/Nc0TYPMipBs?rel=0',
              'Mozilla/5.0 (compatible; archive.org_bot; Wayback Machin