<h1><center> <font size="36"> How to harvest metadata from Zenodo </font> </center></h1>

#### Notebook outline 
 - Zenodo OAI-PMH protocol
 - Zenodo REST API
     - Exploration of the REST API answer (payload) with the `request` library
     - Using `ZenodoCI` lib
     - Using `PyZenodo3` lib

## OAI-PMH protocol

####  - A nice [tutorial to the protocol](https://indico.cern.ch/event/5710/sessions/108048/attachments/988151/1405129/Simeon_tutorial.pdf).

The [OAI-PMH protocol](https://www.openarchives.org/pmh/) uses a base URL + special syntax ('verbs') to query and find metadata representation(s) of a data provider.

In the case of zenodo the base URL is:  https://zenodo.org/oai2d.

For example; 
 - to retrieve all the entries (`verb=ListRecords`)
 - belonging to escape2020 community (`set=user-escape2020`)
 - in the OAI DataCite metadata representation (`metadataPrefix=oai_datacite`)
 
https://zenodo.org/oai2d?verb=ListRecords&set=user-escape2020&metadataPrefix=oai_datacite


Ex2:
 - To obtain a single entry (`verb=GetRecord`)
 - of a certain zenodo record - identified by the entry_id (`identifier=oai:zenodo.org:4105896`)
 - in the Dublin Core metadata representation (`metadataPrefix=oai_dc`)
 
https://zenodo.org/oai2d?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:zenodo.org:4105896

## Example with the OAI-PMH protocol: A python OAI-Harvester

```
pip install oaiharvest
oai-harvest -h

# Examples of usage
oai-harvest https://zenodo.org/oai2d -s "user-escape2020" -d oai_dc
oai-harvest https://zenodo.org/oai2d -s "user-escape2020" -d oai_datacite4
oai-harvest https://zenodo.org/oai2d -s "user-escape2020" -d datacite3

# Example of output
$ oai-harvest https://zenodo.org/oai2d -s "user-escape2020" -d datacite3
$ cd datacite3
$ ls
oai:zenodo.org:1689986.oai_dc.xml oai:zenodo.org:3884963.oai_dc.xml
oai:zenodo.org:2533132.oai_dc.xml oai:zenodo.org:3967386.oai_dc.xml
oai:zenodo.org:2542652.oai_dc.xml oai:zenodo.org:4012169.oai_dc.xml
oai:zenodo.org:2542664.oai_dc.xml oai:zenodo.org:4028908.oai_dc.xml
oai:zenodo.org:3356656.oai_dc.xml oai:zenodo.org:4044010.oai_dc.xml
oai:zenodo.org:3362435.oai_dc.xml oai:zenodo.org:4055176.oai_dc.xml
oai:zenodo.org:3572655.oai_dc.xml oai:zenodo.org:4105896.oai_dc.xml
oai:zenodo.org:3614662.oai_dc.xml oai:zenodo.org:4311271.oai_dc.xml
oai:zenodo.org:3659184.oai_dc.xml oai:zenodo.org:4419866.oai_dc.xml
oai:zenodo.org:3675081.oai_dc.xml oai:zenodo.org:4601451.oai_dc.xml
oai:zenodo.org:3734091.oai_dc.xml oai:zenodo.org:4687123.oai_dc.xml
oai:zenodo.org:3743489.oai_dc.xml oai:zenodo.org:4786641.oai_dc.xml
oai:zenodo.org:3743490.oai_dc.xml oai:zenodo.org:4790629.oai_dc.xml
oai:zenodo.org:3854976.oai_dc.xml
$ cat <FILE>
```

# Query Zenodo's records through its REST API

In [1]:
# pip install request

In [2]:
import requests

In [24]:
token = ''

We would need to specify some arguments to reduce the search

In [4]:
parameters = {'access_token': token,
              'communities': 'escape2020',
              'size':100}

## Example with the `requests` lib - How to recover all ESCAPE2020 community records ?

In [5]:
escape2020 = requests.get('https://zenodo.org/api/records', params=parameters).json()
escape2020.keys()

dict_keys(['aggregations', 'hits', 'links'])

Let's explore the REST API payload to find the desired information.

In [6]:
# Nice summary of the request we just made
escape2020['aggregations']

{'access_right': {'buckets': [{'doc_count': 13, 'key': 'open'}],
  'doc_count_error_upper_bound': 0,
  'sum_other_doc_count': 0},
 'file_type': {'buckets': [{'doc_count': 6, 'key': 'zip'},
   {'doc_count': 3, 'key': 'gz'},
   {'doc_count': 3, 'key': 'pdf'},
   {'doc_count': 2, 'key': 'json'},
   {'doc_count': 1, 'key': ''},
   {'doc_count': 1, 'key': 'md'},
   {'doc_count': 1, 'key': 'simg'}],
  'doc_count_error_upper_bound': 0,
  'sum_other_doc_count': 0},
 'keywords': {'buckets': [{'doc_count': 3, 'key': 'ESCAPE'},
   {'doc_count': 2, 'key': 'CTA'},
   {'doc_count': 1,
    'key': 'European Open Science Cloud, ESFRI, e-Infrastructures'},
   {'doc_count': 1,
    'key': 'Machine Learning, Big Data, Aapche Kafka, Gravitational Wave'},
   {'doc_count': 1, 'key': 'analysis'},
   {'doc_count': 1, 'key': 'c-plus-plus'},
   {'doc_count': 1, 'key': 'cmake'},
   {'doc_count': 1, 'key': 'geant4'},
   {'doc_count': 1, 'key': 'modular'},
   {'doc_count': 1, 'key': 'reconstruction'}],
  'doc_count_

In [7]:
# Total number of entries in the payload
print(escape2020['hits'].keys())
print(escape2020['hits']['total'])

dict_keys(['hits', 'total'])
13


In [8]:
all_entries = escape2020['hits']['hits']

In [9]:
# The content of the first entry of the payload - It contain all the info that we can also find in Zenodo
all_entries[0]

{'conceptdoi': '10.5281/zenodo.3572654',
 'conceptrecid': '3572654',
 'created': '2021-05-25T14:00:15.944774+00:00',
 'doi': '10.5281/zenodo.4790629',
 'files': [{'bucket': '923a2614-a0fa-4927-bb3b-704168f3c768',
   'checksum': 'md5:9787677bb5b63f86459dbdd230b74d0f',
   'key': 'codemeta.json',
   'links': {'self': 'https://zenodo.org/api/files/923a2614-a0fa-4927-bb3b-704168f3c768/codemeta.json'},
   'size': 3714,
   'type': 'json'},
  {'bucket': '923a2614-a0fa-4927-bb3b-704168f3c768',
   'checksum': 'md5:5cba809c8d787b362eaaf9e0529cf09b',
   'key': 'Singularity',
   'links': {'self': 'https://zenodo.org/api/files/923a2614-a0fa-4927-bb3b-704168f3c768/Singularity'},
   'size': 1441,
   'type': ''},
  {'bucket': '923a2614-a0fa-4927-bb3b-704168f3c768',
   'checksum': 'md5:6b992ab60974a8360550c5fe5cdb5239',
   'key': 'Singularity.simg',
   'links': {'self': 'https://zenodo.org/api/files/923a2614-a0fa-4927-bb3b-704168f3c768/Singularity.simg'},
   'size': 283996191,
   'type': 'simg'},
  {'bu

In [10]:
# Example to retrieve entries_ids and titles
for entry in all_entries:
    print(f"{entry['id']} \t {entry['metadata']['title']}")

4790629 	 ESCAPE template project
4786641 	 ZenodoCI
4687123 	 cosimoNigro/agnpy: v0.0.10: added EPWL for electrons and off-axis absorption calculation
4601451 	 gLike: numerical maximization of heterogeneous joint likelihood functions of a common free parameter plus nuisance parameters
4419866 	 IndexedConv/IndexedConv: v1.3
4044010 	 EOSC - a tool for enabling Open Science in Europe
3854976 	 FairRootGroup/DDS
3743489 	 ESCAPE the maze
3675081 	 ESFRI cluster projects - Position papers on expectations and planned contributions to the EOSC
3659184 	 ctapipe_io_mchdf5
3614662 	 FairRoot
3362435 	 FairMQ
3356656 	 A prototype for a real time pipeline for the detection of transient signals and their automatic classification


In [11]:
# Example of all the keywords within each entry
for entry in all_entries:
    try:
        print(f"{entry['id']} \t {entry['metadata']['keywords']}")
    except KeyError:
        pass

4790629 	 ['ESCAPE']
4786641 	 ['ESCAPE']
4419866 	 ['CTA']
4044010 	 ['European Open Science Cloud, ESFRI, e-Infrastructures']
3743489 	 ['ESCAPE']
3659184 	 ['CTA']
3614662 	 ['geant4', 'c-plus-plus', 'cmake', 'reconstruction', 'vmc', 'modular', 'analysis', 'simulation']
3356656 	 ['Machine Learning, Big Data, Aapche Kafka, Gravitational Wave']


#### Let's explore a specific ESCAPE2020 entry, for example `agnpy`.

In [12]:
agnpy = requests.get('https://zenodo.org/api/records/4687123', params=parameters).json()
agnpy.keys()

dict_keys(['conceptdoi', 'conceptrecid', 'created', 'doi', 'files', 'id', 'links', 'metadata', 'owners', 'revision', 'stats', 'updated'])

In [13]:
agnpy['metadata']

{'access_right': 'open',
 'access_right_category': 'success',
 'communities': [{'id': 'escape2020'}],
 'creators': [{'affiliation': "Institut de Física d'Altes Energies (IFAE)",
   'name': 'Cosimo Nigro'},
  {'name': 'Julian Sitarek'},
  {'affiliation': 'Minnesota State University Moorhead', 'name': 'Matt Craig'},
  {'name': 'Paweł Gliwny'},
  {'affiliation': '@sourcery-ai', 'name': 'Sourcery AI'}],
 'description': '<p>In this release the major features added are:</p>\n<ul>\n<li><p>an exponential cutoff power-law for the electron spectra;</p>\n</li>\n<li><p>the possibility to compute the gamma-gamma opacity for misaligned sources (<code>viewing angle != 0</code>) for the following targets: point source behind the jet, BLR and the DT.</p>\n</li>\n</ul>',
 'doi': '10.5281/zenodo.4687123',
 'license': {'id': 'other-open'},
 'publication_date': '2021-04-14',
 'related_identifiers': [{'identifier': 'https://github.com/cosimoNigro/agnpy/tree/v0.0.10',
   'relation': 'isSupplementTo',
   'sch

In [14]:
for file in agnpy['files']:
    print(file['links']['self'])

https://zenodo.org/api/files/a806b549-922e-4025-9453-a5f4c0913fdd/cosimoNigro/agnpy-v0.0.10.zip


We could do a simple `wget` of the previous URL and recover the file updoaded to Zenodo.

Let's see and example with various files uploaded.

In [15]:
ESCAPE_template = requests.get('https://zenodo.org/api/records/4790629', params=parameters).json()

In [16]:
for file in ESCAPE_template['files']:
    print(file['links']['self'])

https://zenodo.org/api/files/923a2614-a0fa-4927-bb3b-704168f3c768/codemeta.json
https://zenodo.org/api/files/923a2614-a0fa-4927-bb3b-704168f3c768/Singularity
https://zenodo.org/api/files/923a2614-a0fa-4927-bb3b-704168f3c768/Singularity.simg
https://zenodo.org/api/files/923a2614-a0fa-4927-bb3b-704168f3c768/template_project_escape-v2.1.zip


## ZenodoCI

All these methods are implemented in the ZenodoCI library (https://gitlab.in2p3.fr/escape2020/wp3/zenodoci), a REST API handler for Zenodo. 

The library is also in charge of automatise the project's uploads from GitLab to Zenodo (by the use of the GitLab-CI and the REST API handler).

In [17]:
# pip install https://gitlab.in2p3.fr/escape2020/wp3/zenodoci/-/archive/master/zenodoci-master.zip

In [18]:
from zenodoci.zenodoapi import ZenodoAPI
z = ZenodoAPI(access_token=token, sandbox=False)

In [19]:
entries = z.fetch_community_entries(community_name='escape2020', 
                                    results_per_query=100)
entries.json()['hits']['total']

13

In [20]:
ids = z.fetch_community_entries_per_id(community_name='escape2020', 
                                       results_per_query=100)

titles = z.fetch_community_entries_per_title(community_name='escape2020', 
                                            results_per_query=100)

for id, title in zip(ids, titles):
    print(id, title)

4790629 ESCAPE template project
4786641 ZenodoCI
4687123 cosimoNigro/agnpy: v0.0.10: added EPWL for electrons and off-axis absorption calculation
4601451 gLike: numerical maximization of heterogeneous joint likelihood functions of a common free parameter plus nuisance parameters
4419866 IndexedConv/IndexedConv: v1.3
4044010 EOSC - a tool for enabling Open Science in Europe
3854976 FairRootGroup/DDS
3743489 ESCAPE the maze
3675081 ESFRI cluster projects - Position papers on expectations and planned contributions to the EOSC
3659184 ctapipe_io_mchdf5
3614662 FairRoot
3362435 FairMQ
3356656 A prototype for a real time pipeline for the detection of transient signals and their automatic classification


## PyZenodo3

Another equivalent example with the pyzenodo3 library

In [21]:
# pip install pyzenodo3

In [22]:
import pyzenodo3

zen = pyzenodo3.Zenodo()
records = zen.search('agnpy')

In [23]:
records[0].data

{'conceptdoi': '10.5281/zenodo.4055175',
 'conceptrecid': '4055175',
 'created': '2021-04-14T13:49:58.312657+00:00',
 'doi': '10.5281/zenodo.4687123',
 'files': [{'bucket': 'a806b549-922e-4025-9453-a5f4c0913fdd',
   'checksum': 'md5:21e85ec3ec312b54e45b39998b4dfa4b',
   'key': 'cosimoNigro/agnpy-v0.0.10.zip',
   'links': {'self': 'https://zenodo.org/api/files/a806b549-922e-4025-9453-a5f4c0913fdd/cosimoNigro/agnpy-v0.0.10.zip'},
   'size': 4662766,
   'type': 'zip'}],
 'id': 4687123,
 'links': {'badge': 'https://zenodo.org/badge/doi/10.5281/zenodo.4687123.svg',
  'bucket': 'https://zenodo.org/api/files/a806b549-922e-4025-9453-a5f4c0913fdd',
  'conceptbadge': 'https://zenodo.org/badge/doi/10.5281/zenodo.4055175.svg',
  'conceptdoi': 'https://doi.org/10.5281/zenodo.4055175',
  'doi': 'https://doi.org/10.5281/zenodo.4687123',
  'html': 'https://zenodo.org/record/4687123',
  'latest': 'https://zenodo.org/api/records/4687123',
  'latest_html': 'https://zenodo.org/record/4687123',
  'self': '