# URLs in Wayback SPN Data

In addition to looking at popular host names it also could be useful to identify popular URLs that people (or bots) archived on each day. Were there attempts to archive multiple things on the same day, and what can we possibly infer about the significance of these multiple attempts?

The trouble is that when a browser interacts with SavePageNow via the [web form](https://web.archive.org) it receive the HTML for the requested webpage which has been rewritten to include some JavaScript. This JavaScript gets the browser to request any additional resources that are needed for rendering the page (JavaScript, images, CSS, etc) through SavePageNow as well. This means that a more high-fidelity recording is made, since all the resources for a web page are needed to make it human readable.

Some of these URLs may be for things like jQuery a Content Deliver Network, or a CSS file. These aren't terribly interesting in terms of this analysis which is attempting to find duplicates in the originally requested page. One thing we can do is limit our analysis to HTML pages, or requests that come back 200 OK with a `Content-Type` HTTP header containing text/html.

In [1]:
from warc_spark import init, extractor

sc, sqlc = init()

In order to find the URLs it's important that we also retain the User-Agent that executed the request, since this tells us something about the person who initiated SavePageNow. Unfortunately the User-Agent is in the WARC Resquest record, and the Content-Type of the response is in the WARC Response record. Luckily these can be connected together using the WARC-Record-ID and the WARC-Concurrent-To WARC headers.

The `get_urls` function takes a WARC Record and depending on whether it is a request or a response will return a tuple containing the record id and either a User-Agent or a URL for a text/html response. For example:

```
(urn:uuid:551471a6-631b-4ef7-99a5-f1344348ab64>', 'Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot)')
(urn:uuid:551471a6-631b-4ef7-99a5-f1344348ab64>', 'https://yahoo.com')
 ```


In [39]:
import re
from urllib.parse import urlparse

@extractor
def get_urls(record):
    
    if record.rec_type == 'request':
        id = record.rec_headers.get_header('WARC-Concurrent-To')
        ua = record.http_headers.get('user-agent')
        if id and ua:
            yield (id, {"ua": ua})
            
    elif record.rec_type in ['response', 'revisit'] and 'html' in record.http_headers.get('content-type', ''):
        id = record.rec_headers.get_header('WARC-Record-ID')
        url = record.rec_headers.get_header('WARC-Target-URI')
        status_code = record.http_headers.get_statuscode()
        
        # not all 200 OK text/html responses are for requests for HTML 
        # for example some sites return 200 OK with some HTML when an image isn't found
        # this big of logic will try to identify known image, css and javascript extensions
        # to elmiminate them from consideration.
        
        uri = urlparse(url)        
        is_dependency = re.match(r'.*\.(gif|jpg|jpeg|js|png|css)$', uri.path)
        if not is_dependency and status_code == '200' and id and url:
            yield (id, {"url": url})

Now we can process our data by selecting the WARC files we want to process and applying the `get_urls` function to them. We then group the results by the WARC-Record-ID to yield something like:

    ('<urn:uuid:551471a6-631b-4ef7-99a5-f1344348ab64>', 'Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot)')
    
    ('<urn:uuid:551471a6-631b-4ef7-99a5-f1344348ab64>', 'https://yahoo.com')

In [40]:
from glob import glob

warc_files = glob('warcs/liveweb-2018*/*.warc.gz')
warcs = sc.parallelize(warc_files)
results = warcs.mapPartitions(get_urls)
results.take(1)

[('<urn:uuid:bcf103dc-ac2f-40ea-928b-9c3b5fec297f>',
  {'url': 'https://www.youtube.com/channel/UC6JnEv4XTE7kAG3B6PCXvnw/about'})]

Now we can use `groupByKey` to merge the User-Agent and URL tuples using the WARC-Record-ID as a key. We are also going to add two new columns for the User-Agent Family and whether it is a known bot. Some JSON files that were developed as part of the UserAgents notebook can help with this. The resulting rows will look someting like this:

    (
        'urn:uuid:551471a6-631b-4ef7-99a5-f1344348ab64>',
        'https://yahoo.com',
        'Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot)',
      
        'https://yahoo.com',
        'archive.org_bot',
        True
    )

In [70]:
def unpack(d1, d2):
    d1.update(d2)
    return d1

# merge the dataset using the record-id
dataset = results.combineByKey(
    lambda d: d,
    unpack,
    unpack
)

dataset.take(10)

[('<urn:uuid:346ee541-a0a6-484a-b117-54fe87710d57>',
  {'url': 'https://www.youtube.com/channel/UC-J-KZfRV8c13fOCkhXdLiQ/about',
   'ua': 'Wget/1.19.5 (linux-gnu)'}),
 ('<urn:uuid:caf7a37b-8e95-45a6-bab7-3d2b6dbd8923>',
  {'ua': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36'}),
 ('<urn:uuid:c12629e9-ee0d-4ea5-92de-db439eeb5051>',
  {'ua': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}),
 ('<urn:uuid:1aa1c607-5b43-4d2b-9804-c801f7287f15>',
  {'ua': 'mediawords bot (http://cyber.law.harvard.edu)'}),
 ('<urn:uuid:6f43b0ec-bd4e-44ab-94ec-ba8dafbfa440>',
  {'ua': 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'}),
 ('<urn:uuid:68c4923d-ef9f-448d-8882-6f523c113d3f>',
  {'ua': 'Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; Microsoft; Lumia 535 Dual SIM)'}),
 ('<urn:uuid:d887bed1-2ef8-4f8f

In [71]:


# flatten the second cell into two different columns
# dataset = dataset.mapValues(list)

# make sure each row has a user-agent and a url (not guaranteed)
#dataset = dataset.filter(lambda r: len(r[1]) == 2)

# get our user-agent mapping dictionaries handy
import json
ua_families = json.load(open('../analysis/results/ua-families.json'))
top_uas = json.load(open('../analysis/results/top-uas.json'))

def unpack(r):
    id = r[0]
    url = r[1].get("url", "")
    ua = r[1].get("ua", "")
    ua_f = ua_families.get(ua, '')
    bot = top_uas.get(ua_f, False)
    return (id, url, ua, ua_f, bot)

dataset = dataset.map(unpack)

# Convert to a Spark DataFrame
df = dataset.toDF(["record_id", "url", "user_agent", "user_agent_family", "bot"])

In [72]:
df.head(10)

[Row(record_id='<urn:uuid:346ee541-a0a6-484a-b117-54fe87710d57>', url='https://www.youtube.com/channel/UC-J-KZfRV8c13fOCkhXdLiQ/about', user_agent='Wget/1.19.5 (linux-gnu)', user_agent_family='Wget', bot=True),
 Row(record_id='<urn:uuid:caf7a37b-8e95-45a6-bab7-3d2b6dbd8923>', url='', user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36', user_agent_family='Chrome', bot=False),
 Row(record_id='<urn:uuid:c12629e9-ee0d-4ea5-92de-db439eeb5051>', url='', user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36', user_agent_family='Chrome', bot=False),
 Row(record_id='<urn:uuid:1aa1c607-5b43-4d2b-9804-c801f7287f15>', url='', user_agent='mediawords bot (http://cyber.law.harvard.edu)', user_agent_family='mediawords bot', bot=False),
 Row(record_id='<urn:uuid:6f43b0ec-bd4e-44ab-94ec-ba8dafbfa440>', url='', user_agent='Mozilla/5.0 (Windows NT 6.1; WOW64; Tri

Ok let's save off these results before we do any more processing.

In [73]:
df.write.csv('../analysis/results/urls')

Now let's count the URLs and see which ones have appeared more than once.

In [74]:
from pyspark.sql.functions import countDistinct, desc

url_counts = df.groupBy("url").count().sort(desc('count'))
url_counts.write.csv('../analysis/results/url-counts/')
url_counts.head(50)

[Row(url='', count=33070),
 Row(url='http://www.witchgif.com/', count=10),
 Row(url='https://platform.twitter.com/widgets/widget_iframe.7922da55a4ca5d4a2b1d31eedc0501e8.html?origin=https%3A%2F%2Fweb.archive.org&settingsEndpoint=%2Fsave%2Fhttps%3A%2F%2Fsyndication.twitter.com%2Fsettings', count=10),
 Row(url='https://platform.twitter.com/widgets/tweet_button.7922da55a4ca5d4a2b1d31eedc0501e8.en.html', count=6),
 Row(url='http://www.chungling.org/event/', count=6),
 Row(url='https://platform.twitter.com/widgets/widget_iframe.7922da55a4ca5d4a2b1d31eedc0501e8.html?origin=http%3A%2F%2Fweb.archive.org&settingsEndpoint=%2Fsave%2Fhttps%3A%2F%2Fsyndication.twitter.com%2Fsettings', count=6),
 Row(url='http://www.clphs.edu.my/', count=5),
 Row(url='https://www.instagram.com/2spoopy4jews.jacket.chan/', count=5),
 Row(url='https://www.instagram.com/jacket.chan/', count=4),
 Row(url='https://www.chungling.org/', count=4),
 Row(url='https://blog.cryptographyengineering.com/2013/05/14/a-few-thoughts-on