# Getting started with dendro data from DANS dataverse

*In this Jupyter notebook I try to download and query dendrochronology data from the DANS dataverse repository. To do this I adapted the [python script](https://github.com/Dans-labs/rce-spatial-coverage/blob/master/scripts/download_ds_files.py) by Alessandra Polimeno (DANS-KNAW).* 



## A first example of downloading all files in a specific record 

In order to locate a dendrochronology record persistent identifier I opened https://dataverse.nl/ in my web browser, entered `dccd` in the search box. We arrive at a collection of datasets by a large amount of organizations: https://dataverse.nl/dataverse/dccd. Some datasets seem to be restricted. Next, I selected `Stichting Ring` with fully open data. This then opens a page with 2459 open records: https://dataverse.nl/dataverse/stichtingring. 

The very first record is this Dorestad series: https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/MSBW8A. The last part of the url contains the persistent identifier: `'10.34894/MSBW8A'`.

The filenames in this record can be printed with the `print_all_files()` function:

> IMPORTANT NOTE: RUN THE 'FUNCTIONS' CODE CELL AT THE END OF THIS NOTEBOOK FIRST BEFORE EXECUTING THESE EXAMPLES.

In [18]:
print_all_files('10.34894/MSBW8A')

['2016003 Dorestad D16 Logboek.pdf', '2016003 Dorestad meetreeksen 1 tot en met 10.fh', 'MANIFEST.TXT']


And can be downloaded to a folder in your file system with the function `download_all_files()`. 

In [5]:
download_all_files('10.34894/MSBW8A', '../../data/downloads')

Extracted: 2016003 Dorestad D16 Logboek.pdf
Extracted: 2016003 Dorestad meetreeksen 1 tot en met 10.fh
Extracted: MANIFEST.TXT
All files saved to '../../data/downloads/10.34894%MSBW8A'


However, if I look at the first record DOI of the complete DCCD, I get this error: 

In [17]:
print_all_files('10.34894/CKVQTX')

Error: 403, {"status":"ERROR","code":403,"message":"Not authorized to access this object via this API endpoint. Please check your code for typos, or consult our API guide at http://guides.dataverse.org.","requestUrl":"https://dataverse.nl/api/v1/access/dataset/:persistentId?persistentId=doi:10.34894/CKVQTX&persistent_id=10.34894%2FCKVQTX","requestMethod":"GET"}


Let's explore this further in the next notebook `02_Dataverse-search-API.ipynb`.

## FUNCTIONS

You need to run this code cell first. 

In [None]:
# This python script was adapted by Frank Ligterink from:  
# https://github.com/Dans-labs/rce-spatial-coverage/blob/master/scripts/download_ds_files.py
# This script contains functions to inspect and download files from a dataset in the Archaeology Data Station repository.
# Author: Alessandra Polimeno (DANS-KNAW)

# The format of the DOIs should have the following structure: "10.17026/dans-xxx-xx0x"
# They can be found under the column "dsPersistentID"

import requests
import zipfile
import io
import os

def print_all_files(persistent_id):
    """
    Print all files in a dataset with the given persistent ID.

    :param persistent_id: The persistent ID of the dataset. 
    """ 
    
    # this is the original url by Alessandra: 
    #url = f"http://archaeology.datastations.nl/api/access/dataset/:persistentId?persistentId=doi:{persistent_id}"

    # this new url works for our dendro data 
    url = f'https://dataverse.nl/api/access/dataset/:persistentId?persistentId=doi:{persistent_id}' 
    params = {"persistent_id": persistent_id}
    response = requests.get(url, params=params, stream=True)

    if response.status_code == 200:
        with zipfile.ZipFile(io.BytesIO(response.content)) as zip_file:
            print(zip_file.namelist())
    else:
        print(f"Error: {response.status_code}, {response.text}")

    print("=================================================================")


def download_all_files(persistent_id, output_path):
    """
    Download all files from a dataset with the given persistent ID.

    :param persistent_id: The persistent ID of the dataset.
    :param output_path: The path to the directory where the files will be saved. If the directory does not exist, it will be created.

    """
    #url = f"http://archaeology.datastations.nl/api/access/dataset/:persistentId?persistentId=doi:{persistent_id}"

    url = f'https://dataverse.nl/api/access/dataset/:persistentId?persistentId=doi:{persistent_id}' 
    params = {"persistent_id": persistent_id}

    output_doi = persistent_id.replace("/", "%")
    output_dir = f"{output_path}/{output_doi}"

    response = requests.get(url, params=params, stream=True)

    if response.status_code == 200:

        with zipfile.ZipFile(io.BytesIO(response.content)) as zip_file:
            os.makedirs(output_dir, exist_ok=True)

            for file_name in zip_file.namelist():
                zip_file.extract(file_name, output_dir)
                print(f"Extracted: {file_name}")

        print(f"All files saved to '{output_dir}'")
    else:
        print(f"Error: {response.status_code}, {response.text}")   
    
    print("=================================================================")





def download_selected_files(persistent_id, selected_files, output_path):
    """
    Download selected files from a dataset with the given persistent ID. You select the files by providing a list of filenames.
    Even if you want to download only one file, you need to provide the filename as a list.

    :param persistent_id: The persistent ID of the dataset.
    :param selected_files: A list containing the filenames to be downloaded.
    :param output_path: The path to the directory where the files will be saved. If the directory does not exist, it will be created.

    """

    url = f"http://archaeology.datastations.nl/api/access/dataset/:persistentId?persistentId=doi:{persistent_id}"
    params = {"persistent_id": persistent_id}

    output_doi = persistent_id.replace("/", "%")
    output_dir = f"{output_path}/{output_doi}"

    response = requests.get(url, params=params, stream=True)

    if response.status_code == 200:

        with zipfile.ZipFile(io.BytesIO(response.content)) as zip_file:
            os.makedirs(output_dir, exist_ok=True)

            zip_filenames = set(zip_file.namelist())  # Get all files in the ZIP
            print(zip_filenames)
            found_files = selected_files.intersection(zip_filenames)
            # missing_files = selected_files - zip_filenames  # Files that are missing

            for file_name in found_files:
                zip_file.extract(file_name, output_dir)
                print(f"Extracted: {file_name}")

            #if missing_files:
            #    print(f"Warning: The following files were not found in the ZIP: {missing_files}")

        print(f"Selected files saved to '{output_dir}'")
    else:
        print(f"Error: {response.status_code}, {response.text}")

    print("=================================================================")


def download_specific_filetype(persistent_id, output_path, filetype): 
    """
    Download all files of a given filetype from the dataset with the specified persistent ID.

    :param persistent_id: The persistent ID of the dataset.
    :param output_path: The path to the directory where the PDF files will be saved. If the directory does not exist, it will be created.
    :param filetype: The file type to be downloaded as a string, e.g. 'xml'

    """

    url = f"http://archaeology.datastations.nl/api/access/dataset/:persistentId?persistentId=doi:{persistent_id}"
    params = {"persistent_id": persistent_id}

    output_doi = persistent_id.replace("/", "%")
    output_dir = f"{output_path}/{output_doi}"

    response = requests.get(url, params=params, stream=True)

    if response.status_code == 200:

        with zipfile.ZipFile(io.BytesIO(response.content)) as zip_file:
            os.makedirs(output_dir, exist_ok=True)

            zip_filenames = set(zip_file.namelist())  # Get all files in the ZIP
            print(zip_filenames)
            selected_files = {file_name for file_name in zip_filenames if file_name.endswith(f'{filetype}')}

            for file_name in selected_files:
                zip_file.extract(file_name, output_dir)
                print(f"Extracted: {file_name}")

        print(f"{filetype} files saved to '{output_dir}'")
    else:
        print(f"Error: {response.status_code}, {response.text}")

    print("=================================================================")