# Dataverse search API

By studying code examples in the previous notebook by Alessandra and the `curl` command by Laura (see below), we can now try to use python to query and download data from the DCCD database. 

I also had a look at the documentation provided by the Dataverse itself. But this is somewhat outdated: https://guides.dataverse.org/en/6.0/api/search.html. 

In [2]:
import requests
import zipfile
import io
import os 
import json

We will use the well established python package `requests` explore the DCCD database. Let's start with a very basic `get()` request and take a look at the response. We are asking for the first 2 items in the DCCD database only.   

In [3]:
response = requests.get('https://dataverse.nl/api/search?q=subtree=dccd&start=0&per_page=2')

We can study the response by looking at it's `status_code` attribute. 

In [7]:
response.status_code 

200

We see here that `status_code == 200` which means `OK`. 

The actual (meta data) content of the response can be inspected with the `.json()` method. 

In [6]:
response.json()

{'status': 'OK',
 'data': {'q': 'subtree=dccd',
  'total_count': 4000,
  'start': 0,
  'spelling_alternatives': {},
  'items': [{'name': 'DCCD',
    'type': 'dataverse',
    'url': 'https://dataverse.nl/dataverse/dccd',
    'identifier': 'dccd',
    'description': 'Digital Collaboratory for Cultural Dendrochronology (DCCD) For more information about Dendrochronological data in DataverseNL, please visit this page. Dendrochronology studies the annually varying ring widths in wood. Tree-ring patterns in wood from the cultural heritage contain unique information about former chronology, social economy, the historical landscape and its uses, climate and wood technology. The DCCD is a digital repository and interactive library of tree-ring data. Its content is developed through research of, among others: archaeological sites (including old landscapes), ship wrecks, buildings, furniture, paintings, sculptures and musical instruments. The DCCD is based on the Tree-Ring Data Standard (TRiDaS) a

Browsing through this result, several interesting pieces of information show up. First of all, the `'total_count'` of items in the DCCD database is 4000. Next thing to notice is that we can have different types of data items in the response. In this case we see that the first item is of the type `'dataverse'`. Our second item in the result is of the type `'dataset'`. However, this does not seem  to be an actual dendrochronological dataset, but a publication.  

If we copy the url of this second item into a web browser we are redirected to this page: https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/UONJRZ

**At the bottom of this page we see a table with the descriptions of three files. However, access to these files appears to be restricted. This is a piece of information that does not seem to appear in the json code above. It is unclear how we can check in advance which datasets are actually open.**  

## Testing file downloads for Stichting RING 

I have understood that the complete data for Stichting RING should be available for download. So for now, let's limit our search scope to datasets in this sub tree, and build formatted strings for searching and downloading, so we can easily reuse it. The curly brackets in these strings are parts that can be replaced with the `format()` method.  

In [74]:
search_url = 'https://dataverse.nl/api/search?q=subtree=stichtingring&type=dataset&start={start}&per_page={per_page}'
download_url = 'https://dataverse.nl/api/access/dataset/:persistentId?persistentId=doi:{persistent_id}' 

In [65]:
url = search_url.format(start=0, per_page=2)
print(f'Search_url: {url}')

Search_url: https://dataverse.nl/api/search?q=subtree=stichtingring&type=dataset&start=0&per_page=2


In [66]:
response = requests.get(url,  stream=True)
response.status_code

200

We can gain inspect the json encoded metadata that is returned for our search query: 

In [68]:
response.json()

{'status': 'OK',
 'data': {'q': 'subtree=stichtingring',
  'total_count': 2366,
  'start': 0,
  'spelling_alternatives': {},
  'items': [{'name': 'Waney-edge based felling dates in 1250-1700 CE established through dendrochronological research of archaeological and built heritage in The Netherlands',
    'type': 'dataset',
    'url': 'https://doi.org/10.34894/ZWBVSW',
    'global_id': 'doi:10.34894/ZWBVSW',
    'description': "Waney-edge based felling dates in 1250-1700 CE established through dendrochronological research of archaeological and built heritage in The Netherlands. Further documentation of the research projects used for this overview can be found at https://dataverse.nl/dataverse/stichtingring by querying for the Laboratory Codes of individual measurement series (e.g. 'https://dataverse.nl/dataverse/stichtingring/?q=ikar0071')",
    'published_at': '2021-11-17T10:13:11Z',
    'publisher': 'Stichting RING',
    'citationHtml': 'Jansma, Esther, 2021, "Waney-edge based felling 

If we now want to check if we can actually download the files that belong to these datasets, we need to extract  the bare DOI codes. 

In [73]:
doi_list = [item['url'].replace('https://doi.org/', '') for item in response.json()['data']['items']]
doi_list

['10.34894/ZWBVSW', '10.34894/GE6CZ2']

The python code below 

In [77]:
for doi in doi_list: 
    url = download_url.format(persistent_id=doi)
    print()
    print(f'Download url: {url}')
    response = requests.get(url, stream=True)

    if response.status_code == 200:
        with zipfile.ZipFile(io.BytesIO(response.content)) as zip_file:
            print('Found these files: ')
            print(zip_file.namelist())
    else: 
        print(f'Status_code={response.status_code}. Could not download files as zip. ')


Download url: https://dataverse.nl/api/access/dataset/:persistentId?persistentId=doi:10.34894/ZWBVSW
Found these files: 
['Jansma 2021_Waney-edge based felling dates in 1250-1700 CE.xlsx', 'MANIFEST.TXT']

Download url: https://dataverse.nl/api/access/dataset/:persistentId?persistentId=doi:10.34894/GE6CZ2
Found these files: 
['Logbook 1998008 AVG.pdf', 'MANIFEST.TXT']


**Conclusion so far is that it should be straightforward to download all data files from Stichting RING.**

## Downloading with the `curl` command line tool 

As explained by Laura the example command to query all dendro records, say between 1400-1800 with `curl` is this terminal command below. The json output can be captured like so in the `output.txt` file. 

    $ curl "https://dataverse.nl/api/search?q=dccd-periodEnd%3A%5B1400%20TO%201800%5D&start=0&per_page=100&subtree=dccd&type=dataset&metadata_fields=dccd:*" | jq > output.txt 

## How about `easyDataverse`?

**This is interesting! However, I do not see how to download all metadata this way.**

See: https://github.com/gdcc/easyDataverse and https://github.com/gdcc/easyDataverse/blob/main/examples/EasyDataverseBasics.ipynb

%pip install easyDataverse

In [45]:
from easyDataverse import Dataverse

In [52]:
#dataverse = Dataverse(server_url='https://demo.dataverse.org')
dataverse = Dataverse(server_url='https://dataverse.nl/')

Output()





In [55]:
dataverse.list_metadatablocks(detailed=False)

geospatial
socialscience
astrophysics
biomedical
journal
citation
comrades-dcl-metadata
dccd
dansDataVaultMetadata


In [56]:
dataset = dataverse.load_dataset() create_dataset()

print(dataset) # Should be empty by now

metadatablocks: {}



In [57]:
dataset.dccd.info()

## How about `pyDataverse` python package? 

**This seems outdated and no longer supported.**

Let's also try to use the `pyDataverse` package for downloading data. 

See: https://pydataverse.readthedocs.io/en/latest/user/basic-usage.html#download-and-save-a-dataset-to-disk

%pip install -U pyDataverse

In [2]:
from pyDataverse.api import NativeApi, DataAccessApi
from pyDataverse.models import Dataverse 

In [22]:
#base_url = 'https://dataverse.harvard.ed'
base_url = 'https://dataverse.nl'
api = NativeApi(base_url, api_version='v1')
data_api = DataAccessApi(base_url)

In [23]:
#DOI = "doi:10.7910/DVN/KBHLOD"
DOI = 'doi:10.34894/MSBW8A'
dataset = api.get_dataset(DOI)


In [24]:
files_list = dataset.json()['data']['latestVersion']['files']

In [25]:
files_list

[{'description': 'Lab logbook',
  'label': '2016003 Dorestad D16 Logboek.pdf',
  'restricted': False,
  'version': 1,
  'datasetVersionId': 29311,
  'dataFile': {'id': 376393,
   'persistentId': '',
   'filename': '2016003 Dorestad D16 Logboek.pdf',
   'contentType': 'application/pdf',
   'friendlyType': 'Adobe PDF',
   'filesize': 2951413,
   'description': 'Lab logbook',
   'storageIdentifier': 'file://1894e4bf037-544cfdd49456',
   'rootDataFileId': -1,
   'md5': '3ceeb18fe19526d0823e4128c805f14c',
   'checksum': {'type': 'MD5', 'value': '3ceeb18fe19526d0823e4128c805f14c'},
   'tabularData': False,
   'creationDate': '2023-07-13',
   'publicationDate': '2023-07-13',
   'fileAccessRequest': False}},
 {'description': 'Measurement series in stacked Heidelberg format',
  'label': '2016003 Dorestad meetreeksen 1 tot en met 10.fh',
  'restricted': False,
  'version': 1,
  'datasetVersionId': 29311,
  'dataFile': {'id': 376394,
   'persistentId': '',
   'filename': '2016003 Dorestad meetree

In [15]:
for file in files_list:
    filename = file["dataFile"]["filename"]
    file_id = file["dataFile"]["id"]
    print("File name {}, id {}".format(filename, file_id))
    response = data_api.get_datafile(file_id)
    with open(filename, "wb") as f:
        f.write(response.content)

File name 2016003 Dorestad D16 Logboek.pdf, id 376393
File name 2016003 Dorestad meetreeksen 1 tot en met 10.fh, id 376394


Almost there. Unfortunately these files are just text files that contain an error message: 

    {"status":"ERROR","code":404,"message":"API endpoint does not exist on this server. Please check your code for typos, or consult our API guide at http://guides.dataverse.org.","requestUrl":"https://dataverse.nl/api/v1/access/datafile/:persistentId/?persistentId=376393","requestMethod":"GET"}

In [27]:
#data_api??