<a href="https://colab.research.google.com/github/edsu/iipc-covid19/blob/master/Seeds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# COVID-19 Seeds

This notebook explores the seeds that are being crawled in the [Novel Coronavirus COVID-19](https://archive-it.org/collections/13529/) Archive-It collection. It uses the [Archive-It Parnter API](https://support.archive-it.org/hc/en-us/articles/360032747311-Access-your-account-with-the-Archive-It-Partner-API) which does not seem to require a key for public collections (yay). More context for this collecting effort can be found in [this IIPC blog post](https://blog.archive.org/2020/02/13/archiving-information-on-the-novel-coronavirus-covid-19/).

## 0. Import

First let's import some things that we are going to need later. It's useful to do them all here at the beginning in case you want to skip parts of the data collection and use the data that is already here.

In [298]:
import csv
import altair
import pandas
import wayback
import datetime
import requests

altair.renderers.enable('html')

RendererRegistry.enable('html')

If you happening to be running this notebook in Colab, you won't have the data subdirectory that's part of the GitHub repository. But we can go and get it. You can run this cell whether you have the data or not.

In [300]:
import os
import shutil

if not os.path.isdir('data'):
    os.system('git clone https://github.com/edsu/iipc-covid19.git')
    os.rename('iipc-covid19/data', 'data')
    shutil.rmtree('iipc-covid19')

## 1. Get the Seeds

First lets download the seeds in the collection and save them as a CSV. If you want to use the CSV that's already here you can move on to **Section 2**. We're going to write out the data to a file called `seeds.csv`. You can see the type of data that is returned by looking at [this API response](https://partner.archive-it.org/api/seed?collection=13529&limit=100). The Archive-It Partner API has a route for returning seeds for a given collection that is indicated with the `collection` parameter. We can use the `limit` and `offset` parameters to walk through the results page by page without getting all of them at once.

In [5]:
url = 'https://partner.archive-it.org/api/seed'
params = {
    "collection": 13529,
    "limit": 100
}

Now we can create a loop that keeps fetching results and incrementing the offset until there are no more seeds. We could have used the CSV output, but it is useful to normalize some of the structured metadata. This will likely take a few minutes to run.

In [9]:
out = csv.writer(open('data/seeds.csv', 'w'))
out.writerow([
    "id",
    "url",
    "creator",
    "created",
    "updated",
    "crawl_definition",
    "title",
    "description",
    "language",
    "tld"
])

def first_val(meta, name):
    return meta[name][0]["value"] if name in meta else None

params['offset'] = 0

while True:
    resp = requests.get(url, params=params)
    seeds = resp.json()
    if len(seeds) == 0: break

    for seed in seeds:
        meta = seed["metadata"]
        out.writerow([
            seed["id"],
            seed["url"],
            seed["created_by"],
            seed["created_date"],
            seed["last_updated_date"],
            seed["crawl_definition"],
            first_val(meta, "Title"),
            first_val(meta, "Description"),
            first_val(meta, "Language"),
            first_val(meta, "Top-Level Domain")
        ])

    params['offset'] += 100

So now you should hopefully see an updated `seeds.csv`!

## 2. Display the Seeds

First lets load our `seeds.csv` into a Pandas DataFrame where we can more easily manipulate it.

In [10]:
seeds = pandas.read_csv('data/seeds.csv', parse_dates=["created", "updated"])
seeds.head()

Unnamed: 0,id,url,creator,created,updated,crawl_definition,title,description,language,tld
0,2147692,http://coronavirus.fr/,alext,2020-02-21 03:43:18.662353+00:00,2020-03-16 19:53:45.860949+00:00,31104294373,Epicorem. Ecoépidémiologie,Medical/Scientific aspects,French,.fr
1,2147693,http://english.whiov.cas.cn/,alext,2020-02-21 03:43:18.706571+00:00,2020-03-16 19:52:28.575749+00:00,31104294373,"Wuhan Institute of Virulogy, official page in ...",Health Organisation,English,.cn
2,2147694,http://www.china-embassy.or.jp/chn/,alext,2020-02-21 03:43:18.739126+00:00,2020-03-16 19:53:03.086729+00:00,31104294373,中华人民共和国驻日本大使馆,Embassy,Chinese,.jp
3,2147695,http://www.china-embassy.or.jp/jpn/,alext,2020-02-21 03:43:18.766308+00:00,2020-03-16 19:54:02.280945+00:00,31104294373,中華人民共和国駐日本国大使館,Embassy,Japanese,.jp
4,2147696,https://cadenaser.com/tag/ncov/a/,alext,2020-02-21 03:43:18.791716+00:00,2020-03-16 19:54:19.694418+00:00,31104294373,Coronavirus de Wuhan,Cadena Ser,Spanish,.com


We can sort them by the last update time in descending order.

In [8]:
seeds = seeds.sort_values('updated', ascending=False)
seeds.head(10)

Unnamed: 0,id,url,creator,created,updated,crawl_definition,title,description,language,tld
351,2160881,https://www.dgs.pt/corona-virus,alext,2020-03-14 21:30:09.708098+00:00,2020-03-20 16:46:24.938117+00:00,31104294373,Direção Geral da Saúde - Coronavírus,Portuguese Authority for Health,Portuguese,.pt
1219,2159647,https://treasury.gov.au/sites/default/files/20...,alext,2020-03-12 14:51:58.674042+00:00,2020-03-20 16:46:18.641060+00:00,31104294373,Economic response to the coronavirus,Government information - Australia,English,.au
935,2159641,https://nl.wikipedia.org/wiki/Uitbraak_SARS-Co...,alext,2020-03-12 14:51:58.562338+00:00,2020-03-20 16:46:09.778125+00:00,31104294373,Uitbraak SARS-CoV-2 in Nederland,Wikipedia,Dutch,.org
1268,2159623,https://coronavirus.nl/,alext,2020-03-12 14:51:57.133264+00:00,2020-03-20 16:46:06.026886+00:00,31104294373,Coronavirus.nl,Maps,Dutch,.nl
1596,2159644,https://projekte.sueddeutsche.de/artikel/wisse...,alext,2020-03-12 14:51:58.611342+00:00,2020-03-20 16:45:01.840025+00:00,31104294373,Coronavirus: Die Wucht der großen Zahl,News Media / statistical calculation,German,.de
681,2159646,https://studentaffairs.unt.edu/student-health-...,alext,2020-03-12 14:51:58.648543+00:00,2020-03-20 16:44:32.926990+00:00,31104294373,Coronavirus,University of North Texas,English,.edu
715,2159629,https://hms.harvard.edu/coronavirus,alext,2020-03-12 14:51:57.258569+00:00,2020-03-20 16:44:31.722017+00:00,31104294373,Coronavirus: Guidance for HMS Community around...,Harvard University,English,.edu
741,2159632,https://medical.mit.edu/news/2020/01/2019-nove...,alext,2020-03-12 14:51:57.333911+00:00,2020-03-20 16:44:30.545712+00:00,31104294373,COVID-19 (coronavirus disease 2019) updat,Massachusetts Institute of Technology,English,.edu
449,2159625,https://coronavirusupdates.uchicago.edu/,alext,2020-03-12 14:51:57.179460+00:00,2020-03-20 16:44:25.996071+00:00,31104294373,Coronavirus Updates,University of Chicago,English,.edu
303,2159633,https://news.dartmouth.edu/covid-19/covid-19-c...,alext,2020-03-12 14:51:57.373570+00:00,2020-03-20 16:44:22.997703+00:00,31104294373,COVID-19: Coronavirus Information,Dartmouth College,English,.edu


## 3. Languages

We can see that there are a large number of Portuguese seeds. I guess because someone involved in web archiving in Portugal or Brazil got busy.

In [15]:
altair.Chart(seeds).mark_bar().encode(
    altair.X('language', title='Language'),
    altair.Y('count(id)')
)

## 4. Created

We can see that most of the vast majority of these seeds were entered into Archive-It on February 20, 2020, presumably from the spreadsheet sitting behind the Google Form.

In [16]:
altair.Chart(seeds).mark_bar().encode(
    altair.X('monthdate(created)', title='Created'),
    altair.Y('count(id)')
)

## 5. Last Update

Similarly we can look to see when the last update time was for each seed.

In [17]:
altair.Chart(seeds).mark_bar().encode(
    altair.X('monthdate(updated)', title='Updtes'),
    altair.Y('count(id)')
)

It looks like most of the seeds were last updated a few days ago. But does this mean that was the last time they were crawled?

## 6. Get the Crawls

Oddly I couldn't seem to get any of the crawl related Partner API endpoints to work. Maybe I need to have created the crawls? At any rate, I can use the URL to look directly in Wayback machine to see what is available. The EDGI folks have created a nice [Wayback](https://wayback.readthedocs.io/en/latest/usage.html) module that lets you easily look up URLs in the Wayback Machine (it uses their CDX API behind the scenes). 

This can take some time, so I'm going to save off the results in a `crawls.csv`. If you prefer to use the stored `crawls.csv` you skip ahead to **Section 7**. This will collect crawl information for these URLs from 2019-10-01 on so we can look at their coverage before and after the project started.

In [31]:
out = csv.writer(open('data/crawls.csv', 'w'))
out.writerow(['timestamp', 'url', 'status_code', 'archive_url'])
wb = wayback.WaybackClient()

for index, row in seeds.iterrows():
    try:
        for crawl in wb.search(row.url, from_date=datetime.datetime(2019, 10, 1)):
            out.writerow([
                crawl.timestamp.isoformat(),
                crawl.url,
                crawl.status_code,
                crawl.view_url
            ])
    except Exception as e:
        print(e)

403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fa2larm.cz%2F2020%2F02%2Fslavoj-zizek-melancholicka-krasa-virove-pandemie%2F&from=20191001000000&showResumeKey=true&resolveRevisits=true
403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fpoliticalcartoons.com%2F%3Fs%3Dcoronavirus&from=20191001000000&showResumeKey=true&resolveRevisits=true
403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fwww.cagle.com%2Fdave-granlund%2F2020%2F01%2Fcoronavirus-usa&from=20191001000000&showResumeKey=true&resolveRevisits=true
403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fnationalpost.com%2Fhealth%2Fbio-warfare-experts-question-why-canada-was-sending-lethal-viruses-to-china&from=20191001000000&showResumeKey=true&resolveRevisits=true


It's interesting that some of the URLs are forbidden for viewing. I'm not sure what's going on there. One important thing to keep in mind is that these URLs could have been crawled by other users of Archive-It or by the Internet Archive's own crawlers.

## 8. View the Crawls

Now lets load in the `crawls.csv` as a DataFrame and look at the number of crawls over time. It's actually useful to save a sorted version of the crawls.csv so that it can easily be diffed with previous versions.

In [32]:
crawls = pandas.read_csv('data/crawls.csv', parse_dates=['timestamp'])
crawls = crawls.sort_values('timestamp')
crawls.to_csv('data/crawls.csv')
crawls

Unnamed: 0,timestamp,url,status_code,archive_url
21930,2019-10-01 01:24:55,https://www.dw.com/,,http://web.archive.org/web/20191001012455/http...
21929,2019-10-01 01:24:55,http://www.dw.com/,,http://web.archive.org/web/20191001012455/http...
25396,2019-10-01 03:14:00,https://cn.ambafrance.org/,200.0,http://web.archive.org/web/20191001031400/http...
21931,2019-10-01 03:21:29,http://dw.com/,301.0,http://web.archive.org/web/20191001032129/http...
14610,2019-10-01 05:03:44,https://www.canada.ca/en/services/immigration-...,200.0,http://web.archive.org/web/20191001050344/http...
...,...,...,...,...
21432,2020-03-21 12:40:32,https://www.saude.gov.br/saude-de-a-z/coronavirus,301.0,http://web.archive.org/web/20200321124032/http...
20969,2020-03-21 12:50:24,https://www.theguardian.com/world/coronavirus-...,200.0,http://web.archive.org/web/20200321125024/http...
4713,2020-03-21 13:06:08,https://www.bundesgesundheitsministerium.de/co...,200.0,http://web.archive.org/web/20200321130608/http...
4940,2020-03-21 13:11:34,https://www.elsevier.com/connect/coronavirus-i...,200.0,http://web.archive.org/web/20200321131134/http...


In [33]:
crawls_per_day = crawls.set_index('timestamp').resample('1D')['url'].count()
crawls_per_day = crawls_per_day.reset_index()
crawls_per_day.columns = ['date', 'crawls']
crawls_per_day

Unnamed: 0,date,crawls
0,2019-10-01,20
1,2019-10-02,48
2,2019-10-03,22
3,2019-10-04,51
4,2019-10-05,35
...,...,...
168,2020-03-17,824
169,2020-03-18,702
170,2020-03-19,673
171,2020-03-20,612


In [34]:
altair.Chart(crawls_per_day, width=800).mark_bar().encode(
    altair.X('date', title='Crawl Date'),
    altair.Y('crawls', title='Crawls')
)

## 9. Missing Crawls

We can definitely see these URLs are being crawled a whole lot more since the start of the project. But the graph shows what has been crawled (irrespective of who did it). It also doesn't show what seed URLs have not been crawled yet.

To see what might be missing lets first group our crawl data by url, and count how many crawls there have been for that url.

In [67]:
crawls_by_url = crawls.groupby('url').count().timestamp
crawls_by_url.name = 'crawls'
crawls_by_url.head()

url
http://9news.com.au/coronavirus                                                                                               2
http://abcnews.go.com/Health/1300-people-died-flu-year/story?id=67754182                                                     83
http://abola.pt/nnh/2020-02-03/formula-1-coronavirus-ameaca-gp-da-china/827542                                                1
http://atarde.uol.com.br/bahia/noticias/2117774-navio-de-singapura-atracara-em-ilheus-pais-registrou-casos-de-coronavirus     4
http://atarde.uol.com.br/brasil/noticias/2117622-coronavirus-voos-sao-vistoriados-pela-anvisa-em-sp                           4
Name: crawls, dtype: int64

Next we can take our `seeds` DataFrame, index it by URL, so that we can add our `crawls_by_url` series to it, since it is also indexed by `url`. Kinda nice how pandas makes this join easy. The use of `fillna` there is to convert any null values (where there has been no crawls yet) to 0.

In [148]:
seeds_by_url = seeds.set_index('url')
seeds_by_url['crawls'] = crawls_by_url
seeds_by_url.crawls = seeds_by_url.crawls.fillna(0)
seeds_by_url.head()

Unnamed: 0_level_0,id,creator,created,updated,crawl_definition,title,description,language,tld,crawls
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
http://coronavirus.fr/,2147692,alext,2020-02-21 03:43:18.662353+00:00,2020-03-16 19:53:45.860949+00:00,31104294373,Epicorem. Ecoépidémiologie,Medical/Scientific aspects,French,.fr,0.0
http://english.whiov.cas.cn/,2147693,alext,2020-02-21 03:43:18.706571+00:00,2020-03-16 19:52:28.575749+00:00,31104294373,"Wuhan Institute of Virulogy, official page in ...",Health Organisation,English,.cn,74.0
http://www.china-embassy.or.jp/chn/,2147694,alext,2020-02-21 03:43:18.739126+00:00,2020-03-16 19:53:03.086729+00:00,31104294373,中华人民共和国驻日本大使馆,Embassy,Chinese,.jp,264.0
http://www.china-embassy.or.jp/jpn/,2147695,alext,2020-02-21 03:43:18.766308+00:00,2020-03-16 19:54:02.280945+00:00,31104294373,中華人民共和国駐日本国大使館,Embassy,Japanese,.jp,18.0
https://cadenaser.com/tag/ncov/a/,2147696,alext,2020-02-21 03:43:18.791716+00:00,2020-03-16 19:54:19.694418+00:00,31104294373,Coronavirus de Wuhan,Cadena Ser,Spanish,.com,53.0


So now we can see which seeds still need to be crawled, or to have their crawls made public?

In [297]:
missing = seeds_by_url[seeds_by_url.crawls == 0.0]
print("{0} URLS are missing crawls, which is {1:.2f}% of the total seeds.".format(
    len(missing),
    len(missing) / len(seeds_by_url) * 100
))

940 URLS are missing crawls, which is 46.24% of the total seeds.
