# COVID-19 Seeds

This notebook explores the seeds that are being crawled in the [Novel Coronavirus COVID-19](https://archive-it.org/collections/13529/) Archive-It collection. It uses the [Archive-It Parnter API](https://support.archive-it.org/hc/en-us/articles/360032747311-Access-your-account-with-the-Archive-It-Partner-API) which does not seem to require a key for public collections (yay).

## Get the Seeds

First lets download the seeds in the collection and save them as a CSV. If you want to use the CSV that's already here you can ignore this section. We're going to write out the data to a file called `seeds.csv`. You can see the type of data that is returned by looking at [this API response](https://partner.archive-it.org/api/seed?collection=13529&limit=100). The Archive-It Partner API has a route for returning seeds for a given collection that is indicated with the `collection` parameter. We can use the `limit` and `offset` parameters to walk through the results page by page without getting all of them at once.

In [16]:
url = 'https://partner.archive-it.org/api/seed'
params = {
    "collection": 13529,
    "limit": 100
}

Now we can create a loop that keeps fetching results and incrementing the offset until there are no more seeds. We could have used the CSV output, but it is useful to normalize some of the structured metadata. This will likely take a few minutes to run.

In [26]:
import csv
import requests

out = csv.writer(open('data/seeds.csv', 'w'))
out.writerow([
    "id",
    "url",
    "creator",
    "created",
    "updated",
    "crawl_definition",
    "title",
    "description",
    "language",
    "tld"
])

def first_val(meta, name):
    return meta[name][0]["value"] if name in meta else None

params['offset'] = 0

while True:
    resp = requests.get(url, params=params)
    seeds = resp.json()
    if len(seeds) == 0: break

    for seed in seeds:
        meta = seed["metadata"]
        out.writerow([
            seed["id"],
            seed["url"],
            seed["created_by"],
            seed["created_date"],
            seed["last_updated_date"],
            seed["crawl_definition"],
            first_val(meta, "Title"),
            first_val(meta, "Description"),
            first_val(meta, "Language"),
            first_val(meta, "Top-Level Domain")
        ])

    params['offset'] += 100

So now you should hopefully see an updated `seeds.csv`!

## Display the Seeds

First lets load our `seeds.csv` into a Pandas DataFrame where we can more easily manipulate it.

In [24]:
import pandas

seeds = pandas.read_csv('seeds.csv', parse_dates=["created", "updated"])
seeds.head()

Unnamed: 0,id,url,creator,created,updated,crawl_definition,title,description,language,tld
0,2147692,http://coronavirus.fr/,alext,2020-02-21 03:43:18.662353,2020-03-16 19:53:45.860949,31104294373,Epicorem. Ecoépidémiologie,Medical/Scientific aspects,French,.fr
1,2147693,http://english.whiov.cas.cn/,alext,2020-02-21 03:43:18.706571,2020-03-16 19:52:28.575749,31104294373,"Wuhan Institute of Virulogy, official page in ...",Health Organisation,English,.cn
2,2147694,http://www.china-embassy.or.jp/chn/,alext,2020-02-21 03:43:18.739126,2020-03-16 19:53:03.086729,31104294373,中华人民共和国驻日本大使馆,Embassy,Chinese,.jp
3,2147695,http://www.china-embassy.or.jp/jpn/,alext,2020-02-21 03:43:18.766308,2020-03-16 19:54:02.280945,31104294373,中華人民共和国駐日本国大使館,Embassy,Japanese,.jp
4,2147696,https://cadenaser.com/tag/ncov/a/,alext,2020-02-21 03:43:18.791716,2020-03-16 19:54:19.694418,31104294373,Coronavirus de Wuhan,Cadena Ser,Spanish,.com
