# A peek into TRIDAS.xml

*It appears that the actual dendro data that we need is not available in the dataverse json data that we explored in the previous two notebooks. Hopefully the actual data is stored in `TRIDAS.xml` files that are stored in some of the DCCD records...* 

In [1]:
import requests
import zipfile
import io
import os 

In [18]:
import requests
import zipfile
import io
import os 


SEARCH_URL = 'https://dataverse.nl/api/search?q=subtree=stichtingring&type=dataset&start={start}&per_page={per_page}'
DOWNLOAD_URL = 'https://dataverse.nl/api/access/dataset/:persistentId?persistentId=doi:{persistent_id}' 

def print_all_files(persistent_id, url=DOWNLOAD_URL):
    """
    Print all files in a dataset with the given persistent ID.

    :param persistent_id: The persistent ID of the dataset. 
    """ 
    
    # this is the original url by Alessandra: 
    #url = f"http://archaeology.datastations.nl/api/access/dataset/:persistentId?persistentId=doi:{persistent_id}"

    # this new url works for our dendro data 
    #url = f'https://dataverse.nl/api/access/dataset/:persistentId?persistentId=doi:{persistent_id}' 

    url = url.format(persistent_id=persistent_id)
    
    params = {"persistent_id": persistent_id}
    response = requests.get(url, params=params, stream=True)

    if response.status_code == 200:
        with zipfile.ZipFile(io.BytesIO(response.content)) as zip_file:
            print(zip_file.namelist())
    else:
        print(f"Error: {response.status_code}, {response.text}")

    print("=================================================================")


def download_all_files(persistent_id, output_path, url=DOWNLOAD_URL):
    """
    Download all files from a dataset with the given persistent ID.

    :param persistent_id: The persistent ID of the dataset.
    :param output_path: The path to the directory where the files will be saved. If the directory does not exist, it will be created.

    """
    #url = f"http://archaeology.datastations.nl/api/access/dataset/:persistentId?persistentId=doi:{persistent_id}"

    # url = f'https://dataverse.nl/api/access/dataset/:persistentId?persistentId=doi:{persistent_id}' 

    url = url.format(persistent_id=persistent_id)
    
    params = {"persistent_id": persistent_id}

    output_doi = persistent_id.replace("/", "%")
    output_dir = f"{output_path}/{output_doi}"

    response = requests.get(url, params=params, stream=True)

    if response.status_code == 200:

        with zipfile.ZipFile(io.BytesIO(response.content)) as zip_file:
            os.makedirs(output_dir, exist_ok=True)

            for file_name in zip_file.namelist():
                zip_file.extract(file_name, output_dir)
                print(f"Extracted: {file_name}")

        print(f"All files saved to '{output_dir}'")
    else:
        print(f"Error: {response.status_code}, {response.text}")   
    
    print("=================================================================")


Here are some downloads that do not contain a `tridas.xml` file.  

In [19]:
doi_list = ['10.34894/ZWBVSW', '10.34894/GE6CZ2']

In [20]:
for doi in doi_list: 
    print_all_files(doi)

['Jansma 2021_Waney-edge based felling dates in 1250-1700 CE.xlsx', 'MANIFEST.TXT']
['Logbook 1998008 AVG.pdf', 'MANIFEST.TXT']


In [23]:
for doi in doi_list: 
    download_all_files(doi, '../../data/downloads')

Extracted: Jansma 2021_Waney-edge based felling dates in 1250-1700 CE.xlsx
Extracted: MANIFEST.TXT
All files saved to '../../data/downloads/10.34894%ZWBVSW'
Extracted: Logbook 1998008 AVG.pdf
Extracted: MANIFEST.TXT
All files saved to '../../data/downloads/10.34894%GE6CZ2'


Now, this record by Marta should contain the tridas.xml data file: 

In [24]:
tridas_doi = '10.34894/GQROG9'

In [27]:
download_all_files(tridas_doi, '../../data/tridas')

Extracted: 2010027 DateringsRapport.pdf
Extracted: associated/
Extracted: associated/2010027WBS bijlage.pdf
Extracted: originalvalues/
Extracted: originalvalues/2010027.xml
Extracted: Logbook 2010027 WBS.pdf
Extracted: tridas.xml
Extracted: originalvalues/WBS00011.fh
Extracted: originalvalues/WBS00021.fh
Extracted: MANIFEST.TXT
All files saved to '../../data/tridas/10.34894%GQROG9'


## A `tridas.xml` file

Here is the complete content of the tridas.xml file: 

In [28]:
!cat '../../data/tridas/10.34894%GQROG9/tridas.xml'

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<tridas:project xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:gml="http://www.opengis.net/gml" xmlns:tridas="http://www.tridas.org/1.2.2">
    <tridas:title>Boerderij Sieverdink Bothoekweg 9, Winterswijk-Brinkheure</tridas:title>
    <tridas:identifier domain="stichtingring.nl">P:2010027</tridas:identifier>
    <tridas:type normalStd="DCCD" normalId="1522" normal="dating" lang="en">datering</tridas:type>
    <tridas:description>houtboringen gebint</tridas:description>
    <tridas:laboratory>
        <tridas:identifier domain="www.stichtingring.nl">L1</tridas:identifier>
        <tridas:name acronym="NLRING">Stichting RING</tridas:name>
        <tridas:address>
            <tridas:addressLine1>PO Box 1600</tridas:addressLine1>
            <tridas:addressLine2>3800 BP</tridas:addressLine2>
            <tridas:cityOrTown>Amersfoort</tridas:cityOrTown>
            <tridas:country>Nederland</tridas:country>
        </tridas:address>

Questions are now: 1) do we find here all the data that we need? 2) What is the best way to parse these xml values? 

## How to parse this XML? 

Here are some relevant links: 

http://www.tridas.org/

http://www.tridas.org/documents/tridas.pdf

I also find things like xsd schema files. I guess I need someone to explain to me how this works. Back to school: https://www.w3schools.com/xml/