# User Agent Activity

This notebook is some noodling around to try to get a sense of what hosts various User-Agents are archiving in SavePageNow. We're going to use Spark to do it.

In [1]:
import sys
sys.path.append('../utils')

from warc_spark import init, extractor

sc, sqlc = init()

Get some SavePageNow WARCs to analyze.

In [2]:
from glob import glob

warc_files = glob('warcs/liveweb-*/*.warc.gz')[0:2]
len(warc_files)

2

Back in another notebook we determined all the User-Agent strings that are present and mapped them to User-Agent families. We also created a set of User-Agents that appear in the top-10 requestors for each year (2013-2018), and whether they are ostensibly a bot or not. Let's load those in, we're going to need them both.

In [3]:
import json
from pprint import pprint

ua_families = json.load(open('results/ua-families.json'))
top_uas = json.load(open('results/top-uas.json'))

print(len(ua_families))
pprint(top_uas)

93842
{'Android': False,
 'BingPreview': False,
 'Chrome': False,
 'Chrome Mobile': False,
 'Chrome Mobile WebView': False,
 'Firefox': False,
 'Firefox Mobile': False,
 'IE': False,
 'Mobile Safari': False,
 'Mozilla': False,
 'OpenBSD ftp': False,
 'Opera': False,
 'Python Requests': False,
 'Safari': False,
 'UptimeRobot': True,
 'Wget': True,
 'archive.org_bot': True,
 'curl': True,
 'okhttp': True}


Ok, let's write an extractor function that will return a tuple of User-Agent family to whether it is a bot or not. Note this will favor false negative identification of bots, since we only classified the top-10 User-Agents per year. But this amounts for the vast majority (todo: how much % wise) of the requests. 

In [4]:
@extractor
def ua_host(rec):
    if rec.rec_type == 'request':
        ua = rec.http_headers.get('user-agent', '')
        host = rec.http_headers.get('host', '')
        yield (ua, host)

Now let's run it!

In [5]:
warcs = sc.parallelize(warc_files)
output = warcs.mapPartitions(ua_host)
df = output.toDF()

In [8]:
print(df.head(1))

[Row(_1='Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot)', _2='img131.hotlinkimage.com')]
