# Archival Novelty

This notebook explores whether requests to archive web resources were the first such request. Or rather, whether the request provided the Internet Archive with first knowledge of the web resource. Let's call this *archival novelty* which will be the percentage of SavePageNow requests which brought new knowledge of a URL to the Internet Archive. We are specifically going to look at *archival novelty* in terms of SavePageNow requests from automated and human agents.

The Internet Archive's [CDX API](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server) can tell us exactly when a given URL has been archived over time. But there are 7 million requests for HTML pages in our dataset. So what we will do is randomly sample the requests and check those against the CDX API.

First we will read in the URLS csv data that we generated in the URLs notebook back into a Spark DataFrame. Remember this only SavePageNow requests that resulted in an HTML responses, so it doesn't include things like images, JavaScript or CSS that SavePageNow could request when proxying an actual browser.

In [1]:
import sys
sys.path.append('../utils')

from warc_spark import init, extractor

sc, sqlc = init()
df = sqlc.read.csv('results/urls', header=True)

## Create Samples

Lets look specifically at 2018. Since millions of SavePageNow requests were received, and it's not feasible to query the CDX API millions of times, we will generate a random sample. Since we know the size of the population (number of requests) we will use the [Yamane Method](http://www.research-system.siam.edu/images/independent/Consumer_acceptance_of_air_purifier_products_in_China/CHAPTER_3.pdf) to calculate the sample size needed for a confidence interval of 5% and a confidence level of 95%.

In [8]:
import pandas

year = df.filter(df.date.like('2018%'))
pop_size = year.count()
sample_size = int(pop_size / (1 + pop_size * (.05**2)))
sample = year.rdd.takeSample(withReplacement=False, num=sample_size, seed=42)

print("sample size:", len(sample))

sample size: 399


In [10]:
sample = pandas.DataFrame(sample, columns=year.schema.fieldNames())
sample.head()

Unnamed: 0,record_id,warc_file,date,url,user_agent,user_agent_family,bot
0,<urn:uuid:e70603a2-096a-4f82-b66d-7bc2bb5f286e>,warcs/liveweb-20181025043139/live-201810250352...,2018-10-25T04:12:23Z,http://eestipaevaleht.se/,Python-urllib/2.7,Python-urllib,True
1,<urn:uuid:8d3949bd-c7c4-45a1-b11f-90a0975de679>,warcs/liveweb-20181025212424/live-201810252105...,2018-10-25T21:23:14Z,http://km.aifb.kit.edu/projects/numbers/web/n2...,Wget/1.19.4 (darwin17.3.0),Wget,True
2,<urn:uuid:87c47520-05d9-47ac-b8b5-79408549caff>,warcs/liveweb-20181025170957/live-201810251658...,2018-10-25T17:05:13Z,https://socialblade.com/youtube/channel/UCCmXc...,Wget/1.19.5 (linux-gnu),Wget,True
3,<urn:uuid:0b7d69ca-d38c-4d6c-9f91-bcb63c82265c>,warcs/liveweb-20181025000937/cachegw-201810240...,2018-10-24T13:44:56Z,https://www.reg.ru/domain/shop/lot/rockderzhav...,Mozilla/5.0 (compatible; archive.org_bot; Wayb...,archive.org_bot,True
4,<urn:uuid:572b5006-3a51-43df-94fc-95ccfea727af>,warcs/liveweb-20181025165702/live-201810251631...,2018-10-25T16:47:20Z,http://abelhas.pt/action/LastAccounts/LastSeen...,Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:63...,Firefox,False


## Add Novelty

Now let's create a function that will return all the times a URL was archived. If supplied the `to` parameter will limit the results to snapshots that were taken prior to (and including) that time. This will limit the results we need to collect since we don't really need to know how many times a URL was collected *after* the time in question.

In [11]:
import requests

def archive_times(url, to=None):
    params = {"output": "json", "url": url, "limit": 10000, 'showResumeKey': True}
    if to:
        params['to'] = to
    
    while True:
        resp = requests.get('http://web.archive.org/cdx/search/cdx', params=params)
        
        if resp.status_code == 403:
            # catch "Blocked Site Error" when robots.txt prevents lookup?
            # e.g. http://www.jeuxvideo.com/forums/42-51-53683620-1-0-1-0-pour-etre-assure-au-niveau-de-la-sante-aux-usa.htm 
            yield None
            break
            
        results = resp.json()
        # ignore header
        for result in results[1:]:
            if len(result) > 1:
                yield int(result[1])
        
        if len(results) != 0 and len(results[-1]) == 1:
            params['resumeKey'] = results[-1][0]
        else:
            break


print('https://mith.umd.edu', list(archive_times('https://mith.umd.edu/', to=20020929183602)))

https://mith.umd.edu [20000815201749, 20001204145800, 20010201091600, 20010203124200, 20010224212144, 20010301092808, 20010302112313, 20010309133144, 20010404212235, 20010418164605, 20010515215143, 20010721201430, 20010924024921, 20011202065537, 20020601134123, 20020604040017, 20020802093507, 20020927173901, 20020929183602]


Now that we have a function that returns all the times that a URL was archived we can create another function `new_url` that takes a uRL and a date and determines whether the snapshot taken at that time was the first time the URL was seen.

In [12]:
import re

def new_url(url, date):
    # convert 2018-10-25T21:47:18Z to 20181025214718
    request_date = int(re.sub(r'[:TZ-]', '', date))
    
    # cdx results should be sorted, but we'll make sure
    times = sorted(archive_times(url, to=request_date))
    if len(times) > 0:
        return request_date == times[0]
    else:
        # sometimes it seems Wayback CDX doesn't know about things in the WARCs?
        # e.g. http://josephinedark.net/code.php?PHPSESSID=96bb60faa7f178b21b949e9359129459
        return False

Let's test it out on the first ten rows in the humans sample:

In [13]:
for i, row in sample.head(5).iterrows():
    print(row.url, new_url(row.url, row.date))

http://eestipaevaleht.se/ False
http://km.aifb.kit.edu/projects/numbers/web/n2481513 True
https://socialblade.com/youtube/channel/UCCmXcYtA4T9wxc3vmbGFxRA False
https://www.reg.ru/domain/shop/lot/rockderzhava.ru?rid=2014 False
http://abelhas.pt/action/LastAccounts/LastSeenRotation?TimeStamp=1540485933469&itemsCount=84&inRow=7&pageSize=21&page=4 True


You can go over to [Internet Archive's Wayback Machine](https://web.archive.org) to confirm the results. Now we can create our new column named `new_url` to indicate whether the URL was new when it was added to the Internet Archive. Depending on the number of rows in the sample this can take some time, since each row triggers a CDX API lookup, which will take about a second.

In [14]:
sample['new_url'] = pandas.Series(
    [new_url(r.url, r.date) for i, r in sample.iterrows()], 
    index=sample.index
)

In [16]:
sample.head(5)

Unnamed: 0,record_id,warc_file,date,url,user_agent,user_agent_family,bot,new_url
0,<urn:uuid:e70603a2-096a-4f82-b66d-7bc2bb5f286e>,warcs/liveweb-20181025043139/live-201810250352...,2018-10-25T04:12:23Z,http://eestipaevaleht.se/,Python-urllib/2.7,Python-urllib,True,False
1,<urn:uuid:8d3949bd-c7c4-45a1-b11f-90a0975de679>,warcs/liveweb-20181025212424/live-201810252105...,2018-10-25T21:23:14Z,http://km.aifb.kit.edu/projects/numbers/web/n2...,Wget/1.19.4 (darwin17.3.0),Wget,True,True
2,<urn:uuid:87c47520-05d9-47ac-b8b5-79408549caff>,warcs/liveweb-20181025170957/live-201810251658...,2018-10-25T17:05:13Z,https://socialblade.com/youtube/channel/UCCmXc...,Wget/1.19.5 (linux-gnu),Wget,True,False
3,<urn:uuid:0b7d69ca-d38c-4d6c-9f91-bcb63c82265c>,warcs/liveweb-20181025000937/cachegw-201810240...,2018-10-24T13:44:56Z,https://www.reg.ru/domain/shop/lot/rockderzhav...,Mozilla/5.0 (compatible; archive.org_bot; Wayb...,archive.org_bot,True,False
4,<urn:uuid:572b5006-3a51-43df-94fc-95ccfea727af>,warcs/liveweb-20181025165702/live-201810251631...,2018-10-25T16:47:20Z,http://abelhas.pt/action/LastAccounts/LastSeen...,Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:63...,Firefox,False,True


Now we can calcuate the novelty for 2018, or the probability that a SavePageNow request generated brought a new URL to the archive.

In [17]:
novelty = len(sample.query('new_url == True')) / len(sample)
print(novelty)

0.37343358395989973


Now we can say: On October 25, 2018 37% of SavePageNow requests added a new (novel) URL to the archive. The margin of sampling error is 5% with 95% level of confidence.

Let's save our sampled dataset since it did take some time to generate.

In [18]:
sample.to_csv('results/novelty-sample.csv')

## Archival Novelty

Now that we've added the `new_url` column we can calculate the *archival novelty* for each sample as the percentage of SavePageNow requests that brought brand new URLs to the Internet Archive.

In [None]:
bots_novelty = len(bots_df.query("new_url == True")) / len(bots_df) * 100
print("bots: %{0:.2f}".format(bots_novelty))

humans_novelty = len(humans_df.query('new_url == True')) / len(humans_df) * 100
print("humans: %{0:.2f}".format(humans_novelty))

In [None]:
bots_df.to_csv('results/bots-sample.csv')
humans_df.to_csv('results/humans-sample.csv')

It would now be interesting to look at the percentages by year for all our years and visualize them as a barchart. To collect the data by year we'll create a function that bundles up all the operations we just did so we can call it for each year.

In [None]:
def novelty(year, bots):
    pop = df.filter(df.date.like(year + '%')).filter(df.bot == bots)
    pop_count = pop.count()
    
    if pop_count == 0:
        return 0.0, 0, 0
    
    # determine n: 
    
    sample = pop.sample(0.0001)
    sample_df = sample.toPandas()

    sample_df['new_url'] = pandas.Series(
        [new_url(r.url, r.date) for i, r in sample_df.iterrows()], 
        index=sample_df.index
    )
    
    novelty = len(sample_df.query("new_url == True")) / len(sample_df)
    return novelty, len(sample_df), pop_count

In [None]:
novelty_by_year = {'novelty': [], 'year': [], 'agent': [], 'p': [], 'n': []}

for year in range(2013, 2019):
    y = str(year)
    for bots in [True, False]:
        nov, n, p = novelty(y, bots=bots)
        print("year={} novelty={} n={} p={}".format(y, nov, n, p))
        novelty_by_year['novelty'].append(nov)
        novelty_by_year['year'].append(y)
        novelty_by_year['agent'].append('bots' if bots else 'humans')   
        novelty_by_year['p'].append(p)
        novelty_by_year['n'].append(n)

In [None]:
import altair
altair.renderers.enable('notebook')

novelty_df = pandas.DataFrame(novelty_by_year)

chart = altair.Chart(novelty_df).mark_bar().encode(
    altair.X('bots:N', title=''),    
    altair.Y('novelty:Q', axis=altair.Axis(format='%')),
    altair.Color('bots:N'), #, title='Bots', scale=altair.Scale(scheme='tableau20'))
    altair.Column('year:O', title=''),
)

chart = chart.properties(
    width=50,
    title='Archival Novelty by Year'
)

chart