# User Agents

This notebook examines the User Agents that were used when asking for content to be saved at Internet Archive's SavePageNow.

## Analyze

In [1]:
from warc_spark import init, extractor
sc, sqlc = init()

An extractor function to get the User-Agent from the WARC Request objects:

In [2]:
@extractor
def ua(rec):
    if rec.rec_type == 'request':
        ua = rec.http_headers.get('user-agent')
        if ua:
            yield ua

Let's create a function that will run a Spark job to get the User-Agent counts for a year as a dictionary.

In [32]:
import glob
import pandas

def get_year(year):
    warc_files = glob.glob('warcs/liveweb-{}*/*.warc.gz'.format(year))
    warcs = sc.parallelize(warc_files)
    output = warcs.mapPartitions(ua)
    return output.countByValue()

Now we can use `get_year` for each year of data we have for SPN and create a pandas DataFrame of the results.

In [7]:
ua_data = {}
for year in range(2013, 2019):
    print(year)
    ua_data[str(year)] = get_year(year)

df = pandas.DataFrame(ua_data)
df.index.name = 'ua'

2013
2014
2015
2016
2017
2018


Save it for later:

In [33]:
df.to_csv('../analysis/results/ua.csv')

## Visualize

If the CSV has already been calculated we can load it here instead of waiting 6 hours for the analysis to run.

In [30]:
df = pandas.read_csv('../analysis/results/ua.csv', index_col='ua')

How many distinct user-agents are there?

In [34]:
len(df.index)

93843

In [31]:
df = df.sort_values(by='2018', ascending=False)
df['2018'].head(20)

ua
Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot)                                                 1487305.0
Wget/1.19.5 (linux-gnu)                                                                                                                                              719941.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36                                                  560050.0
Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; http://archive.org/details/archive.org_bot)                                                   506645.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0                                                                                       243023.0
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36             