# Dataverse search API

*As a next step, we need to learn how to query the complete DCCD dataverse...* 

## Listing all records

As explained by Laura the example command to query all dendro records, say between 1400-1800 with `curl` is this terminal command below. The json output can be captured like so in the `output.txt` file. 

    $ curl "https://dataverse.nl/api/search?q=dccd-periodEnd%3A%5B1400%20TO%201800%5D&start=0&per_page=100&subtree=dccd&type=dataset&metadata_fields=dccd:*" | jq > output.txt 

This works well, and provides meta data for the the first 100 records. An alternative to the `curl` command line tool might be the Python search API.  

Let's explore the Search API: https://guides.dataverse.org/en/6.0/api/search.html. At the bottom of this page we find an iteration example: 

```python
#!/usr/bin/env python
import urllib2
import json
base = 'https://demo.dataverse.org'
rows = 10
start = 0
page = 1
condition = True # emulate do-while
while (condition):
    url = base + '/api/search?q=*' + "&start=" + str(start)
    data = json.load(urllib2.urlopen(url))
    total = data['data']['total_count']
    print "=== Page", page, "==="
    print "start:", start, " total:", total
    for i in data['data']['items']:
        print "- ", i['name'], "(" + i['type'] + ")"
    start = start + rows
    page += 1
    condition = start < total
```

**Unfortunately, this example seems to be highly outdated. So instead, let's try to build the query with the code from Alessandra.**

Here is the json output for the first 5 datasets from the `dccd` subtree. 

In [30]:
import requests
import zipfile
import io
import os 
import json

# this seems a reasonable search query that should capture all datasets? 
url = f'https://dataverse.nl/api/search?q=subtree=dccd&type=dataset&start={start}&per_page={per_page}'

#params = {"persistent_id": persistent_id}
#response = requests.get(url, params=params, stream=True)
response = requests.get(url, stream=True)

In [31]:
response.status_code

200

In [33]:
records = response.json()
records

{'status': 'OK',
 'data': {'q': 'subtree=dccd',
  'total_count': 3648,
  'start': 0,
  'spelling_alternatives': {},
  'items': [{'name': 'Towards an international research and data infrastructure for dendrochronology',
    'type': 'dataset',
    'url': 'https://doi.org/10.34894/UONJRZ',
    'global_id': 'doi:10.34894/UONJRZ',
    'description': 'Dendrochronological research project',
    'published_at': '2021-09-29T07:39:46Z',
    'publisher': 'DCCD internationalization project NWO 2010-2013',
    'citationHtml': 'Jansma, E., 2013, "Towards an international research and data infrastructure for dendrochronology", <a href="https://doi.org/10.34894/UONJRZ" target="_blank">https://doi.org/10.34894/UONJRZ</a>, DataverseNL, V1',
    'identifier_of_dataverse': 'dccdintprojnwo',
    'name_of_dataverse': 'DCCD internationalization project NWO 2010-2013',
    'citation': 'Jansma, E., 2013, "Towards an international research and data infrastructure for dendrochronology", https://doi.org/10.34894/

We now need extract the total number of records in order to download all parts. 

In [24]:
n_records = records['data']['total_count']
n_records

3648

Another problem that we need to solve now is how to merge the json batches. 

In [37]:
start = 0 
per_page = 2
search_url = f'https://dataverse.nl/api/search?q=subtree=dccd&start={start}&per_page={per_page}'

In [38]:
response = requests.get(search_url, stream=True)

In [39]:
response.json()

{'status': 'OK',
 'data': {'q': 'subtree=dccd',
  'total_count': 4000,
  'start': 0,
  'spelling_alternatives': {},
  'items': [{'name': 'DCCD',
    'type': 'dataverse',
    'url': 'https://dataverse.nl/dataverse/dccd',
    'identifier': 'dccd',
    'description': 'Digital Collaboratory for Cultural Dendrochronology (DCCD) For more information about Dendrochronological data in DataverseNL, please visit this page. Dendrochronology studies the annually varying ring widths in wood. Tree-ring patterns in wood from the cultural heritage contain unique information about former chronology, social economy, the historical landscape and its uses, climate and wood technology. The DCCD is a digital repository and interactive library of tree-ring data. Its content is developed through research of, among others: archaeological sites (including old landscapes), ship wrecks, buildings, furniture, paintings, sculptures and musical instruments. The DCCD is based on the Tree-Ring Data Standard (TRiDaS) a

In [35]:
# def bulk_download_dccd(): 



# step 1: determine total count from first item 
start = 0
per_page = 1

#url = f'https://dataverse.nl/api/search?q=subtree=dccd&type=dataset&start={start}&per_page={per_page}' 
url = f'https://dataverse.nl/api/search?q=subtree=dccd&type=dataset&start={start}&per_page={per_page

response_json = requests.get(url, stream=True).json()
response_json



{'status': 'OK',
 'data': {'q': 'subtree=dccd',
  'total_count': 3648,
  'start': 0,
  'spelling_alternatives': {},
  'items': [{'name': 'Towards an international research and data infrastructure for dendrochronology',
    'type': 'dataset',
    'url': 'https://doi.org/10.34894/UONJRZ',
    'global_id': 'doi:10.34894/UONJRZ',
    'description': 'Dendrochronological research project',
    'published_at': '2021-09-29T07:39:46Z',
    'publisher': 'DCCD internationalization project NWO 2010-2013',
    'citationHtml': 'Jansma, E., 2013, "Towards an international research and data infrastructure for dendrochronology", <a href="https://doi.org/10.34894/UONJRZ" target="_blank">https://doi.org/10.34894/UONJRZ</a>, DataverseNL, V1',
    'identifier_of_dataverse': 'dccdintprojnwo',
    'name_of_dataverse': 'DCCD internationalization project NWO 2010-2013',
    'citation': 'Jansma, E., 2013, "Towards an international research and data infrastructure for dendrochronology", https://doi.org/10.34894/

## Parsing json and extracting persistent id's 

Let's see how easy it is to extract the persistent id's from these records...

In [17]:
urls = [record['url'] for record in records['data']['items']]
urls

['https://doi.org/10.34894/UONJRZ',
 'https://doi.org/10.34894/CKVQTX',
 'https://doi.org/10.34894/PFVS2M',
 'https://doi.org/10.34894/WEVUWM',
 'https://doi.org/10.34894/08BFJQ']

## How about `easyDataverse`?

**This is interesting! However, I do not see how to download all metadata this way.**

See: https://github.com/gdcc/easyDataverse and https://github.com/gdcc/easyDataverse/blob/main/examples/EasyDataverseBasics.ipynb

%pip install easyDataverse

In [45]:
from easyDataverse import Dataverse

In [52]:
#dataverse = Dataverse(server_url='https://demo.dataverse.org')
dataverse = Dataverse(server_url='https://dataverse.nl/')

Output()





In [55]:
dataverse.list_metadatablocks(detailed=False)

geospatial
socialscience
astrophysics
biomedical
journal
citation
comrades-dcl-metadata
dccd
dansDataVaultMetadata


In [56]:
dataset = dataverse.load_dataset() create_dataset()

print(dataset) # Should be empty by now

metadatablocks: {}



In [57]:
dataset.dccd.info()

## How about `pyDataverse` python package? 

**This seems outdated and no longer supported.**

Let's also try to use the `pyDataverse` package for downloading data. 

See: https://pydataverse.readthedocs.io/en/latest/user/basic-usage.html#download-and-save-a-dataset-to-disk

%pip install -U pyDataverse

In [2]:
from pyDataverse.api import NativeApi, DataAccessApi
from pyDataverse.models import Dataverse 

In [22]:
#base_url = 'https://dataverse.harvard.ed'
base_url = 'https://dataverse.nl'
api = NativeApi(base_url, api_version='v1')
data_api = DataAccessApi(base_url)

In [23]:
#DOI = "doi:10.7910/DVN/KBHLOD"
DOI = 'doi:10.34894/MSBW8A'
dataset = api.get_dataset(DOI)


In [24]:
files_list = dataset.json()['data']['latestVersion']['files']

In [25]:
files_list

[{'description': 'Lab logbook',
  'label': '2016003 Dorestad D16 Logboek.pdf',
  'restricted': False,
  'version': 1,
  'datasetVersionId': 29311,
  'dataFile': {'id': 376393,
   'persistentId': '',
   'filename': '2016003 Dorestad D16 Logboek.pdf',
   'contentType': 'application/pdf',
   'friendlyType': 'Adobe PDF',
   'filesize': 2951413,
   'description': 'Lab logbook',
   'storageIdentifier': 'file://1894e4bf037-544cfdd49456',
   'rootDataFileId': -1,
   'md5': '3ceeb18fe19526d0823e4128c805f14c',
   'checksum': {'type': 'MD5', 'value': '3ceeb18fe19526d0823e4128c805f14c'},
   'tabularData': False,
   'creationDate': '2023-07-13',
   'publicationDate': '2023-07-13',
   'fileAccessRequest': False}},
 {'description': 'Measurement series in stacked Heidelberg format',
  'label': '2016003 Dorestad meetreeksen 1 tot en met 10.fh',
  'restricted': False,
  'version': 1,
  'datasetVersionId': 29311,
  'dataFile': {'id': 376394,
   'persistentId': '',
   'filename': '2016003 Dorestad meetree

In [15]:
for file in files_list:
    filename = file["dataFile"]["filename"]
    file_id = file["dataFile"]["id"]
    print("File name {}, id {}".format(filename, file_id))
    response = data_api.get_datafile(file_id)
    with open(filename, "wb") as f:
        f.write(response.content)

File name 2016003 Dorestad D16 Logboek.pdf, id 376393
File name 2016003 Dorestad meetreeksen 1 tot en met 10.fh, id 376394


Almost there. Unfortunately these files are just text files that contain an error message: 

    {"status":"ERROR","code":404,"message":"API endpoint does not exist on this server. Please check your code for typos, or consult our API guide at http://guides.dataverse.org.","requestUrl":"https://dataverse.nl/api/v1/access/datafile/:persistentId/?persistentId=376393","requestMethod":"GET"}

In [27]:
#data_api??