# PURLs

The Internet Archive [took over the maintenance](https://blog.archive.org/2016/09/27/persistent-url-service-purl-org-now-run-by-the-internet-archive/) of the purl.org service from OCLC 5 years ago. I recently learned that the purl.org service is some caching + a Python program that fetches data store in the Internet Archive's storage. It turns out each purl.org namespace like *https://purl.org/dc/* is an item in the [purl_collection](https://archive.org/details/purl_collection). Each item has a JSON document that containins the URL patterns used by the namespace.

For example here's the one for DublinCore:



In [15]:
import json

dc = requests.get('https://archive.org/download/purl_dc/purl_dc_purl.json').json()
print(json.dumps(dc, indent=2))

{
  "name": "/dc",
  "created": "2009-07-22 17:44:37",
  "maintainer": {
    "fullname": "Dublin Core Metadata Initiative",
    "userid": "purl@dublincore.net",
    "email": "purl@dublincore.net",
    "affiliation": "DCMI"
  },
  "purls": [
    {
      "type": "302",
      "target": "http://purl.org/DC/",
      "name": "/DC",
      "created": "2009-07-22 19:36:26",
      "modified": "2012-11-16 14:25:18"
    },
    {
      "type": "partial",
      "target": "http://dublincore.org/",
      "name": "/DC/",
      "created": "2009-07-23 01:48:18",
      "modified": "2012-11-16 14:25:37"
    },
    {
      "type": "302",
      "target": "http://dublincore.org/DCMI.rdf",
      "name": "/dc/aboutdcmi",
      "created": "2010-02-28 08:51:06",
      "modified": "2010-02-28 08:51:06"
    },
    {
      "type": "partial",
      "target": "/dc/test/terms/",
      "name": "/dc/test/terms",
      "created": "2012-05-28 16:59:52",
      "modified": "2012-06-09 12:26:35"
    },
    {
      "type": "pa

There is also a history of the previous versions:

In [19]:
hist = requests.get('https://archive.org/download/purl_dc/purl_dc_purl_history.json').json()
print(json.dumps(hist, indent=2))

{
  "history": [
    {
      "name": "/dc/topicmap",
      "type": "302",
      "target": "http://orc.dev.oclc.org:9016/iiop/TopicMap?browse=dublin_core",
      "status": 0,
      "modtime": "2009-07-22 19:11:39",
      "user": "admin"
    },
    {
      "name": "/DC",
      "type": "302",
      "target": "http://purl.org/DC/",
      "status": 0,
      "modtime": "2009-07-22 19:36:26",
      "user": "admin"
    },
    {
      "name": "/DC",
      "type": "302",
      "target": "http://purl.org/DC/",
      "status": 1,
      "modtime": "2012-11-16 14:25:18",
      "user": "admin"
    },
    {
      "name": "/DC/",
      "type": "partial",
      "target": "http://dublincore.org/",
      "status": 0,
      "modtime": "2009-07-23 01:48:18",
      "user": "admin"
    },
    {
      "name": "/DC/",
      "type": "partial",
      "target": "http://dublincore.org/",
      "status": 1,
      "modtime": "2012-11-16 14:25:37",
      "user": "admin"
    },
    {
      "name": "/dc/science",
      

## Download PURL

So we can search for all of the namespaces and download their JSON namespace data using the [internetarchive](https://pypi.org/project/internetarchive/) Python library for interacting with the Internet Archive storage API.

In [None]:
import time
from internetarchive import search_items, download

data_dir = pathlib.Path('data/purl')

for match in search_items('collection:purl_collection'):
    item_id = match['identifier']
    
    # if it hasn't been downloaded yet
    if not (data_dir / item_id).is_dir():
        download(item_id, destdir=data_dir, glob_pattern='*.json')
        time.sleep(.5)
    else:
        print(f'already downloaded {item_id}')

Running this *for a few days* resulted in 21,680 namespace items being downloaded. 26 IA items were not downloaded because they lacked the JSON files needed to understand them as PURL namespaces. Here is a list of those in case you are interested:

* purl_NetLink
* purl_RadioNegashi.net
* purl_SeanPetiya
* purl_TechOpsTraining
* purl_candycesessayprompt
* purl_cyber_cbo
* purl_gramaticie
* purl_gzirtadiss
* purl_iip-srm.owl
* purl_iot_vocab_m3
* purl_ist_interoperability-key-point
* purl_ist_interoperable-quality-metric
* purl_jorge_gramie
* purl_jrgjuanf_gramie
* purl_linkeddatatest
* purl_map4scrutiny_core
* purl_net_NetLink
* purl_net_overload
* purl_oblp
* purl_pradeo_cbo
* purl_schema_cocoon
* purl_swarms
* purl_unibomp_schema
* purl_xapi_ontology_
* purl_xinli
* purl_yoolib
 