# Liveliness

It would be useful to know what archived URLs are still available on the web, and which are not. Determining if something is still available on the web is surprisingly fraught, because some web servers respond to 200 OK instead of 404 Not Found for things that are no longer available. So it's possible that we will get false positives in these results (higher numbers of URLs that are alive). But that's ok because we'll get a generous measure of liveliness which means we won't over-

We do have a lot of URLs, and it's probably easier for us if we sample again. It also would be useful to be able to sample by year. Lets start by getting our urls dataset into Spark again.

In [1]:
import sys
sys.path.append('../utils')

from warc_spark import init

sc, sqlc = init()
urls = sqlc.read.csv('results/urls', header=True)

Start with 2018:

In [3]:
year = urls.filter(urls.date.startswith('2018-'))

Create our sample, with a confidence interval of 5% and a confidence level of 95%.

In [4]:
pop_size = year.count()
sample_size = int(pop_size / (1 + pop_size * (.05**2)))
sample_list = year.rdd.takeSample(withReplacement=False, num=sample_size, seed=21)

Put the sample into a Pandas DataFrame:

In [5]:
import pandas

sample = pandas.DataFrame(sample_list, columns=year.schema.fieldNames())
sample = sample.set_index('record_id')
sample.head()

Unnamed: 0_level_0,warc_file,date,url,user_agent,user_agent_family,bot
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
<urn:uuid:584496d7-5296-4395-9362-7f7aba03ece2>,warcs/liveweb-20181025034513/live-201810250335...,2018-10-25T03:39:24Z,https://www.youtube.com/channel/UCgefQJC5UgbWJ...,Wget/1.19.5 (linux-gnu),Wget,True
<urn:uuid:8c658003-6728-4cfa-a178-6d062fc0e5b0>,warcs/liveweb-20181025014552/live-201810250114...,2018-10-25T01:15:18Z,http://km.aifb.kit.edu/projects/numbers/web/n2...,Wget/1.19.4 (darwin17.3.0),Wget,True
<urn:uuid:b4450919-a293-48c4-a163-31818f3ddb4a>,warcs/liveweb-20181025070807/live-201810250655...,2018-10-25T06:58:26Z,https://www.wykop.pl/wpis/36150773/mow-do-mnie...,Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:6...,Firefox,False
<urn:uuid:4f5a9ec0-4056-4336-99c9-de811dadf077>,warcs/liveweb-20181025070807/live-201810250647...,2018-10-25T07:13:52Z,https://www.instagram.com/mufmunkedal/,python-requests/2.19.1,Python Requests,True
<urn:uuid:2d08611a-8639-4dcb-8891-445454493535>,warcs/liveweb-20181025111513/live-201810251107...,2018-10-25T11:12:37Z,https://www.wykop.pl/wpis/36154923/wrocil-i-od...,Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:6...,Firefox,False


Now let's create a function that checks if a URL is still available.

In [6]:
import requests

def alive(url):
    result = False
    try:
        resp = requests.get(url, timeout=10)
        if resp.status_code == 200:
            result = True
    except Exception as e:
        pass
    return result

for url in ['https://nytimes.com', 'https://nytimes.com/no-way-forgetaboutit', 'https://this.is.not.a.host.name']:
    print(url, alive(url))

https://nytimes.com True
https://nytimes.com/no-way-forgetaboutit False
https://this.is.not.a.host.name False


Now lets add a new column to our data containing whether it is on the Internet now.

In [7]:
sample['alive'] = sample.url.map(alive)

In [8]:
print('n={}'.format(len(sample)))
print('dead={}'.format(len(sample.query('alive == False'))))
print('alive={}'.format(len(sample.query('alive == True'))))


n=399
dead=15
alive=384


Let's bundle it up in a function so we can run it for different years.

In [9]:
def liveliness(y):
    year = urls.filter(urls.date.startswith(str(y) + '-'))
    pop_size = year.count()
    sample_size = int(pop_size / (1 + pop_size * (.05**2)))
    sample_list = year.rdd.takeSample(withReplacement=False, num=sample_size, seed=21)
    sample = pandas.DataFrame(sample_list, columns=year.schema.fieldNames())
    sample = sample.set_index('record_id')
    sample['alive'] = sample.url.map(alive)
    return sample

First make sure it returns the same thing that we calculated previously:

In [10]:
df = liveliness(2018)

In [12]:
def print_stats(df):
    print('n={}'.format(len(df)))
    print('dead={}'.format(len(df.query('alive == False'))))
    print('alive={}'.format(len(df.query('alive == True'))))

print_stats(df)

n=399
dead=15
alive=384


They match! So now we can easily generate some stats for 2017.

In [13]:
df = liveliness(2017)
print_stats(df)

n=399
dead=350
alive=49


In [14]:
df = liveliness(2016)
print_stats(df)

n=399
dead=97
alive=302
