# Archival Novelty

This notebook explores whether requests to archive web resources were the first such request. Or rather, whether the request provided the Internet Archive with first knowledge of the web resource. Let's call this *archival novelty* which will be a percentage of SavePageNow requests brought new knowledge of a URL to the Internet Archive. We are specifically going to look at *archival novelty* in terms of SavePageNow requests from automated agent or human agents.

The Internet Archive's [CDX API](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server) can tell us exactly when a given URL has been archived over time. But there are 7 million requests for HTML pages in our dataset. So what we will do is generate two random samples for the bots and the humans, and check those against the CDX API.

First we will read in the URLS csv data that we generated in the URLs notebook back into a Spark DataFrame. Remember this only SavePageNow requests that generated an HTML responses, so it doesn't include things like images, JavaScript or CSS that SavePageNow could request when proxying an actual browser. Also we have added a 

In [1]:
import sys
sys.path.append('../utils')

from warc_spark import init, extractor

sc, sqlc = init()
df = sqlc.read.csv('results/urls', header=True)

## Create Samples

Now lets look specifically at 2018, and generate random samples.

In [35]:
count = df.count()

bots = df.filter(df.bot == True)
bots_sample = bots.sample(0.0001)
print("bots n={} p={}".format(bots_sample.count(), bots.count()))

humans = df.filter(df.bot == False)
humans_sample = humans.sample(0.0001)
print("humans n={} p={}".format(humans_sample.count(), humans.count()))

bots n=643 p=6292710
humans n=124 p=1164959


## CDX API

Now let's create a function that will return all the times a URL was archived:

In [122]:
import requests

def archive_times(url):
    params = {"output": "json", "url": url, "limit": 10000, 'showResumeKey': True}
    
    while True:
        resp = requests.get('http://web.archive.org/cdx/search/cdx', params=params)
        
        if resp.status_code == 403:
            # catch "Blocked Site Error" when robots.txt prevents lookup?
            # e.g. http://www.jeuxvideo.com/forums/42-51-53683620-1-0-1-0-pour-etre-assure-au-niveau-de-la-sante-aux-usa.htm 
            yield None
            break
            
        results = resp.json()
        # ignore header
        for result in results[1:]:
            if len(result) > 1:
                yield int(result[1])
        
        if len(results[-1]) == 1:
            params['resumeKey'] = results[-1][0]
        else:
            break


print('https://mith.umd.edu', list(archive_times('https://mith.umd.edu')))

https://mith.umd.edu [20000815201749, 20001204145800, 20010201091600, 20010203124200, 20010224212144, 20010301092808, 20010302112313, 20010309133144, 20010404212235, 20010418164605, 20010515215143, 20010721201430, 20010924024921, 20011202065537, 20020601134123, 20020604040017, 20020802093507, 20020927173901, 20020929183602, 20021112230229, 20021121215012, 20021129100210, 20030129090435, 20030210123421, 20030606163556, 20030623163824, 20030729101028, 20030807092139, 20031004213724, 20031020011843, 20031028053940, 20031205193630, 20031221233101, 20040226182444, 20040406100203, 20040513075220, 20040604141330, 20040609124550, 20040721080121, 20040829181242, 20040904071755, 20040918002826, 20040918084203, 20040923032527, 20041020081020, 20041215090815, 20050203173638, 20050208082903, 20050208212844, 20050218052645, 20050305024827, 20050403074903, 20050527010839, 20050602075829, 20050907160608, 20050910021613, 20050913220944, 20050917041658, 20050923072026, 20051018095613, 20051028210510, 20

## Annotate Samples

Now we can add a new column to the sample data to indicate if the URL was new to Internet Archive at at the time. This function `new_url` takes a url and a date and determines whether the snapshot taken at that time was the first one ever seen at that time.

In [82]:
import re

def new_url(url, date):
    # convert 2018-10-25T21:47:18Z to 20181025214718
    request_date = int(re.sub(r'[:TZ-]', '', date))
    
    # cdx results should be sorted, but we'll make sure
    earliest = sorted(archive_times(url))[0]
    
    return request_date == earliest

Let's test it out on the first ten rows in the humans sample:

In [83]:
for row in humans_sample.take(10):
    print(row.url, row.date, new_url(row.url, row.date))

http://8ch.net/wooo/res/5262.html 2015-10-25T10:20:34Z False
http://video.corriere.it/video-embed/84da3978-5c4d-11e4-a063-152f34c0ded7?playerType=article&autoPlay=false 2014-10-25T14:19:37Z True
http://forums.watchuseek.com/f21/62mas-reissue-sizing-benchmarks-sumo-turtle-tuna-black-bay-4444410-4.html 2017-10-25T00:21:44Z False
https://www.reddit.com/r/soccer/comments/3q4hbg/west_ham_chairmans_tweet_after_they_beat_chelsea/ 2015-10-25T09:26:37Z True
https://dawos.ru/brendy/frederique-constant/horological-smartwatch 2018-10-25T13:20:05Z False
http://8ch.net/justvideogames/res/18.html 2015-10-24T21:40:57Z False
http://www.gbi2000.ru/stenovye-bloki.html 2017-10-25T06:05:58Z False
https://www.alphapolis.co.jp/manga/987285849/366101853/episode/687506 2017-10-25T02:50:16Z True
https://quizlet.com/86339220/combo-with-social-sciences-and-history-general-clep-flash-cards/ 2018-10-25T15:50:17Z True
http://webshufu.com/fp2/ 2016-10-25T23:02:19Z False


Now we can create our new column, which is easier to do in pandas. Depending on the number of rows in the sample this can take some time, since each row triggers an API lookup, which will take a bout a second.

In [117]:
import pandas

humans_df = humans_sample.toPandas()

humans_df['new_url'] = pandas.Series(
    [new_url(r.url, r.date) for i, r in humans_df.iterrows()], 
    index=humans_df.index
)

In [118]:
humans_df.head(5)

Unnamed: 0,record_id,warc_file,date,url,user_agent,user_agent_family,bot,new_url
0,<urn:uuid:98afdd6c-06d6-4a23-8361-174b9e251f30>,warcs/liveweb-20151025100340/live-201510250940...,2015-10-25T10:20:34Z,http://8ch.net/wooo/res/5262.html,Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKi...,Chrome,False,False
1,<urn:uuid:c2bb6225-f9a8-4561-babb-99ae69251886>,warcs/liveweb-20141025150505/live-201410251409...,2014-10-25T14:19:37Z,http://video.corriere.it/video-embed/84da3978-...,Mozilla/5.0 (Windows NT 6.1; rv:33.0) Gecko/20...,Firefox,False,True
2,<urn:uuid:fce71614-83f5-416b-a5c4-82c534533eda>,warcs/liveweb-20171025004543/live-201710242352...,2017-10-25T00:21:44Z,http://forums.watchuseek.com/f21/62mas-reissue...,Mozilla/5.0 (Windows NT 3.0; rv:10.0) Gecko/20...,Firefox,False,False
3,<urn:uuid:cde4afe5-d8a2-4b20-a158-8a540365ba53>,warcs/liveweb-20151025060156/live-201510250536...,2015-10-25T09:26:37Z,https://www.reddit.com/r/soccer/comments/3q4hb...,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) G...,Firefox,False,True
4,<urn:uuid:5cb869f9-5402-4b2c-8add-4c0767f9a34d>,warcs/liveweb-20181025131830/live-201810251311...,2018-10-25T13:20:05Z,https://dawos.ru/brendy/frederique-constant/ho...,Mozilla/5.0 (Windows NT 6.0; WOW64; rv:39.0) G...,Firefox,False,False


In [92]:
bots_df = bots_sample.toPandas()

bots_df['new_url'] = pandas.Series(
    [new_url(r.url, r.date) for i, r in bots_df.iterrows()], 
    index=bots_df.index
)

In [116]:
bots_df.head(5)

Unnamed: 0,record_id,warc_file,date,url,user_agent,user_agent_family,bot,new_url
0,<urn:uuid:92c008dc-88a7-447d-8371-9d342279facd>,warcs/liveweb-20171025181342/live-201710251823...,2017-10-25T18:41:12Z,https://www.zoominfo.com/p/Megan-Tudehope/-179...,okhttp/3.8.1,okhttp,True,True
1,<urn:uuid:91ef7360-75ff-4030-b116-58ae9dace43a>,warcs/liveweb-20171025234843/live-201710252337...,2017-10-25T23:56:29Z,https://www.zoominfo.com/p/John-Umbach/-186608...,okhttp/3.8.1,okhttp,True,True
2,<urn:uuid:5061729f-4cd8-4af4-b47e-c87422ce7d79>,warcs/liveweb-20171025015630/live-201710250132...,2017-10-25T01:47:10Z,https://www.zoominfo.com/p/Puay-Ong/-1621467602,okhttp/3.8.1,okhttp,True,True
3,<urn:uuid:317277fe-beb3-4f24-b094-3c894702388a>,warcs/liveweb-20171025050806/live-201710250429...,2017-10-25T04:49:30Z,https://www.zoominfo.com/p/Haisam-Sohail/-1628...,okhttp/3.8.1,okhttp,True,True
4,<urn:uuid:7f75c993-6949-4b60-8783-8b0ac61ead99>,warcs/liveweb-20171025200831/cachegw-201710250...,2017-10-25T10:19:06Z,http://www.aarvanss.in/robots.txt,Mozilla/5.0 (compatible; archive.org_bot; Wayb...,archive.org_bot,True,False


## Archival Novelty

Now that we've added the `new_url` column we can calculate the *archival novelty* for each sample as the percentage of SavePageNow requests that brought brand new URLs to the Internet Archive.

In [119]:
bots_novelty = len(bots_df.query("new_url == True")) / len(bots_df) * 100
print("bots: %{0:.2f}".format(bots_novelty))

humans_novelty = len(humans_df.query('new_url == True')) / len(humans_df) * 100
print("humans: %{0:.2f}".format(humans_novelty))

bots: %71.85
humans: %52.42
