# Generate multimedia extension of Pycnogonida dataset

This script serves to get the link of image file using BOLD REST API based on the BOLD process ID provided.

Ideally multimedia extension was planned to be used, however information such as `copyright_holder`, `media_descriptor`
are not exposed through BOLD REST API even though the information are available in the download file on specimen page.
 e.g. https://boldsystems.org/index.php/Public_RecordView?processid=DSPYC066-11

From download file:

```xml
<specimen_imagery>
    <media>
        <mediaID>1441700</mediaID>
        <media_descriptor>Ventral</media_descriptor>
        <copyright>
            <copyright_holder>Chester Sands</copyright_holder>
            <copyright_year>2012</copyright_year>
            <copyright_license>CreativeCommons - Attribution Non-Commercial Share-Alike</copyright_license>
            <copyright_institution>British Antarctic Survey</copyright_institution>
        </copyright>
        <photographer>Jana Dmel</photographer>
        <image_file>http://www.boldsystems.org/pics/DSPYC/_DSC1995+1318537548.JPG</image_file>
    </media>
    <media>
        <mediaID>1441704</mediaID>
        <media_descriptor>Dorsal</media_descriptor>
        <copyright>
            <copyright_holder>Chester Sands</copyright_holder>
            <copyright_year>2012</copyright_year>
            <copyright_license>CreativeCommons - Attribution Non-Commercial Share-Alike</copyright_license>
            <copyright_institution>British Antarctic Survey</copyright_institution>
        </copyright>
        <photographer>Jana Dmel</photographer>
        <image_file>http://www.boldsystems.org/pics/DSPYC/_DSC1997+1318537624.JPG</image_file>
    </media>
</specimen_imagery>
```

From API:

http://v3.boldsystems.org/index.php/API_Public/specimen?ids=DSPYC066-11

```xml
<specimen_imagery>
    <media>
        <mediaID>1441700</mediaID>
        <caption />
        <metatags />
        <copyright>CreativeCommons - Attribution Non-Commercial Share-Alike</copyright>
        <image_file>http://v3.boldsystems.org/pics/DSPYC/_DSC1995+1318537548.JPG</image_file>
    </media>
    <media>
        <mediaID>1441704</mediaID>
        <caption />
        <metatags />
        <copyright>CreativeCommons - Attribution Non-Commercial Share-Alike</copyright>
        <image_file>http://v3.boldsystems.org/pics/DSPYC/_DSC1997+1318537624.JPG</image_file>
    </media>
</specimen_imagery>
```

Hence, the link of images will only be used to populate the `associatedMedia` field.

## Read occurrence sheet published through google sheet

In [67]:
import pandas as pd

occ_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vTei7SwuwPhXU1pYD_QU3nANM5JXjZW6ti-gt2BjVFoq0mqPcrXdVDXXeF2L2etG_gbuQXz53KSfI4H/pub?gid=0&single=true&output=csv"
df = pd.read_csv(occ_url)
df.head()
df.columns

Index(['eventID', 'occurrenceID', 'verbatimKingdom', 'verbatimPhylum',
       'verbatimClass', 'verbatimOrder', 'verbatimFamily ', 'verbatimGenus',
       'Species', 'verbatimScientificName ', 'scientificName ',
       'identificationQuantifier ', 'taxonRank', 'identificationRemarks ',
       'individualCount', 'Identified by', 'identifiedByID',
       'Physical specimen?', 'institutionCode', 'institutionID', 'BOLD',
       'BOLD_ID', 'associatedOccurrences', 'occurrenceRemarks', 'epifauna',
       'eggs_young', 'sex', 'lifeStage', 'Gear', 'samplingGear',
       'cruiseReport', 'samplingProtocol', 'verbatimEventDate', 'Day', 'Month',
       'Year ', 'eventDate', 'verbatimLatitude', 'End_Lat', 'decimalLatitude',
       'verbatimLongitude', 'End_Long', 'decimalLongitude',
       'coordinatePrecision', 'coordinateUncertaintyInMeters', 'footprintWKT',
       'geodeticDatum', 'minimumDepthInMeters', 'maximumDepthInMeters',
       'fieldNumber', 'eventRemarks', 'scientificNameID', 'kingdom',

## Get occurrenceID and BOLD ID

In [68]:
occ_bold = df[['occurrenceID', 'BOLD_ID']]
occ_bold = occ_bold[occ_bold['BOLD_ID'].notna()]

# get unique BOLD ID
bold_ids = occ_bold.BOLD_ID.unique()
bold_ids

array(['NUIG006-20', 'NUIG002-20', 'NUIG013-20', 'NUIG003-20',
       'NUIG001-20', 'NUIG012-20', 'NUIG005-20', 'NUIG008-20',
       'NUIG009-20', 'NUIG010-20', 'NUIG011-20', 'NUIG004-20',
       'NUIG007-20', 'NUIG014-20', 'NUIG017-20', 'NUIG018-20',
       'NUIG019-20', 'NUIG034-20', 'NUIG022-20', 'NUIG023-20',
       'NUIG029-20', 'NUIG030-20', 'NUIG031-20', 'NUIG032-20',
       'NUIG024-20', 'NUIG036-20', 'NUIG042-20', 'NUIG299-20',
       'NUIG055-20', 'NUIG089-20', 'NUIG262-20', 'NUIG263-20',
       'NUIG317-20', 'NUIG266-20', 'NUIG274-20', 'NUIG308-20',
       'NUIG093-20', 'NUIG098-20', 'NUIG222-20', 'NUIG277-20',
       'NUIG284-20', 'NUIG289-20', 'NUIG294-20', 'NUIG298-20',
       'NUIG309-20', 'NUIG314-20', 'NUIG325-20', 'NUIG332-20',
       'NUIG336-20', 'NUIG340-20', 'NUIG048-20', 'NUIG086-20',
       'NUIG090-20', 'NUIG094-20', 'NUIG108-20', 'NUIG119-20',
       'NUIG144-20', 'NUIG159-20', 'NUIG170-20', 'NUIG188-20',
       'NUIG181-20', 'NUIG191-20', 'NUIG205-20', 'NUIG2

# Retrieve link to images

Retrieve link to images using specimen data retrieval endpoints. Documentation of web services at:
https://v3.boldsystems.org/index.php/resources/api?type=webservices#combined

In [69]:
import requests
import defusedxml.ElementTree as ET

bold_imgs = []
BASE_URL = "http://v3.boldsystems.org/index.php/API_Public/specimen?"

for bold_id in bold_ids:
    # only get image link if BOLD_ID has DSPY because NUI ones are not public yet
    if bold_id.find('DSPY') == 0:
        image_list = []
        response = requests.get(BASE_URL, params={'ids': bold_id, 'format': 'xml'})
        content = response.text
        if content:
            etree = ET.fromstring(content)
            image_links = etree.findall(".//specimen_imagery/media/image_file")
            for img in image_links:
                image_list.append(img.text)
            images = ' | '.join(image_list)  # join multiple image links with pipes
            bold_img = [bold_id, images]
            bold_imgs.append(bold_img)

# Write data into csv file

In [70]:
import datetime
import csv

file_name = "../data/generated/{}_bold-img.csv".format(datetime.datetime.now().date())
with open(file_name, 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['BOLD_ID', 'associatedMedia'])
    writer.writerows(bold_imgs)
