# OAI-PMH
> Open Archives Initiative Protocol for Metadata Harvesting 

The protocol is specified [here](http://www.openarchives.org/OAI/openarchivesprotocol.html), and applied to the Royal Danish Library as explained [here](https://github.com/Det-Kongelige-Bibliotek/access-digital-objects).

## Identify OAI provider

Download data and show raw output:

In [1]:
!curl -sL "http://www.kb.dk/cop/oai/?verb=Identify" > /tmp/Identify.tmp
!cat /tmp/Identify.tmp

<?xml version="1.0" encoding="UTF-8" ?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2020-04-02T08:35:27Z</responseDate><request verb="Identify">http://www5.kb.dk/cop/oai/</request><Identify><repositoryName>COP2 Repository</repositoryName><baseURL>http://www5.kb.dk/cop/oai/</baseURL><protocolVersion>2.0</protocolVersion><adminEmail>webmaster@kb.dk</adminEmail><earliestDatestamp>2000-01-01</earliestDatestamp><deletedRecord>no</deletedRecord><granularity>YYYY-MM-DD</granularity><compression>gzip</compression><compression>deflate</compression><description><oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oai-identifier" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-identifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd"><sc

We can use `xq` to make this easier to read:

In [2]:
!cat /tmp/Identify.tmp | xq '.' | head -n 20

{
  "OAI-PMH": {
    "@xmlns": "http://www.openarchives.org/OAI/2.0/",
    "@xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance",
    "@xsi:schemaLocation": "http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd",
    "responseDate": "2020-04-02T08:35:27Z",
    "request": {
      "@verb": "Identify",
      "#text": "http://www5.kb.dk/cop/oai/"
    },
    "Identify": {
      "repositoryName": "COP2 Repository",
      "baseURL": "http://www5.kb.dk/cop/oai/",
      "protocolVersion": "2.0",
      "adminEmail": "webmaster@kb.dk",
      "earliestDatestamp": "2000-01-01",
      "deletedRecord": "no",
      "granularity": "YYYY-MM-DD",
      "compression": [
        "gzip",


## List sets

Download data and show raw output:

In [3]:
!curl -sL "http://oai.kb.dk/oai/provider?verb=ListSets" > /tmp/ListSets.tmp
!head -n 20 /tmp/ListSets.tmp

<?xml version="1.0" encoding="UTF-8" ?>



<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
 http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2020-04-02T08:35:27Z</responseDate>
<request verb="ListSets">http://oai.kb.dk/oai/provider</request>

<ListSets>
	
		<set> 
			
			<!-- This set contains 913 records -->
			
			<setSpec>kb:boeger:ww1</setSpec>	
			<setName>Aleph Books WW1</setName>
			<setDescription>


Use `xq` to filter the set-names, and display the first few:

In [4]:
set_names = !cat /tmp/ListSets.tmp | xq -r '."OAI-PMH" | .ListSets | .set[] | .setName'
print('First 5 set names:', set_names[:5])
print('Total set names: ', len(set_names))

First 5 set names: ['Aleph Books WW1', 'Daells Varehus', 'David Simonsens Arkiv', 'David Simonsens Haandskrifter', 'Gieddes samling']
Total set names:  39


In [5]:
set_specs = !cat /tmp/ListSets.tmp | xq -r '."OAI-PMH" | .ListSets | .set[] | .setSpec'
print('First 5 set specs:', set_specs[:5])

First 5 set specs: ['kb:boeger:ww1', 'kb.daellsvarehus', 'kb.dsa', 'kb.dsh', 'kb.gie']


## List records

Note that the above set-specs appear to be incomplete. Functional versions of some set-specs are listed [here](https://github.com/Det-Kongelige-Bibliotek/access-digital-objects/blob/master/oai-pmh.md). For example, we can copy-paste this set-spec for the Billeder edition:

In [6]:
set_name = "oai:kb.dk:images:billed:2010:okt:billeder"

We can now pull metadata on the first 1000 records in that edition:

In [7]:
set_url = f"http://www.kb.dk/cop/oai/?verb=ListRecords&set={set_name}&metadataPrefix=mods"
!curl -sL "$set_url" > /tmp/billeder.tmp

Inspecting the last entry:

In [8]:
!cat /tmp/billeder.tmp | xq '.' | tail -n 204 > /tmp/entry.tmp
!head -n 20 /tmp/entry.tmp

          "header": {
            "identifier": "oai:kb.dk:images:billed:2010:okt:billeder:object367448",
            "datestamp": "2018-08-29T16:10:37Z",
            "setSpec": "oai:kb.dk:images:billed:2010:okt:billeder"
          },
          "metadata": {
            "md:mods": {
              "@xmlns:md": "http://www.loc.gov/mods/v3",
              "@xmlns:xlink": "http://www.w3.org/1999/xlink",
              "@xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance",
              "@xsi:schemaLocation": "http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-3.xsd",
              "md:identifier": [
                {
                  "@type": "uri",
                  "#text": "http://www.kb.dk/images/billed/2010/okt/billeder/object367448/da/"
                },
                {
                  "@xmlns:java": "http://xml.apache.org/xalan/java",
                  "@xmlns:mix": "http://www.loc.gov/mix/v10",
                  "@xmlns:t": "http://ww

However, it is unclear whether page-counts are actually provided in this manner.

## Pull image

We can harvest til image URL of the last entry like this:

In [9]:
raw_image_url_line = !cat /tmp/entry.tmp | grep .jpg | head -n 1
image_url = raw_image_url_line[0].split('"')[-2]
print(image_url)

http://kb-images.kb.dk/DAMJP2/online_master_arkiv_3/non-archival/Images/BLADTE/erik_thorsen/erik_thorsen_kasse1/db_erik_thorsen_00027/full/full/0/native.jpg


... and display it like this:

In [10]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url=f"{image_url}")