## Download & Analyze Darwin Core Archive(s)
Originally based on script from Matt Biddle
https://github.com/MathewBiddle/sandbox/blob/main/notebooks/query_obis_api.ipynb

Use this notebook to do some analysis and visualize your DwC archive.
Your DwC archive must be accessible on an IPT server.

In [60]:
# Import requests and set the OBIS API base URL. 
import requests
import json
import pandas as pd
import urllib

# Convenience function to pretty print JSON objects
def print_json(myjson):
    print(json.dumps(
        myjson,
        sort_keys=True,
        indent=4,
        separators=(',', ': ')
    ))

# Initialize the base URL for OBIS. This variable will be used for every API call
OBIS_URL = "https://api.obis.org/v3"

# The "name" of the IPT node we want to inspect 
MY_NODE_NAME = 'OBIS Canada'

# The OBIS ID assigned to the dataset we want to inspect
# You can get this from the OBIS URL of your dataset: https://obis.org/dataset/{OBIS dataset ID here} 
DATASET_ID = '24e96d02-8909-4431-bc61-8cf8eadc9b7a'

In [63]:
# === get the DwC archive
# figure out the node ID
req = requests.get(f'{OBIS_URL}/node')
nodes_json = req.json()
df_nodes = pd.DataFrame(nodes_json['results'])
nodeID = df_nodes.loc[df_nodes['name']==MY_NODE_NAME,'id'].tolist()[0]

# get metadata from all datasets @ MY_NODE_NAME
req = requests.get(f'{OBIS_URL}/dataset?nodeid={nodeID}')
datasets = req.json()['results']

# get the dataset we care about:
for ds in datasets:
    # print(ds['id'])
    if ds['id'] == DATASET_ID:
        dataset = ds
        print(f"found datset {dataset['title']} \n\tw/ ID = {dataset['id']}")
        break
else:
    raise ValueError(f"No dataset found with ID[{DATASET_ID}")


found datset Trawl Catch and Species Abundance from the 2019 Gulf of Alaska International Year of the Salmon Expedition 
	w/ ID = 24e96d02-8909-4431-bc61-8cf8eadc9b7a


In [64]:
# grab metadata about the dataset

from bs4 import BeautifulSoup

columns = ['title','url','size_raw','size_MB']

df = pd.DataFrame(
        columns=columns
    )

print(dataset['title'])
print(dataset['url'])
html_text = requests.get(dataset['url']).text
soup = BeautifulSoup(html_text, 'html.parser')

size_raw = soup.find('td').text.split('(')[1].split(')')[0]
size = float(size_raw.split(" ")[0].replace(",",""))
size_unit = size_raw.split(" ")[1]

#convert sizes to MB
if size_unit == 'KB':
    size = size*0.001
elif size_unit == 'MB':
    size = size

df_init = pd.DataFrame(
            {"title": dataset['title'],
             "url": dataset['url'],
             "size_raw": size_raw,
             "size_MB": size,
             },
          index=[1])

df = pd.concat([df, df_init], ignore_index=True)

Trawl Catch and Species Abundance from the 2019 Gulf of Alaska International Year of the Salmon Expedition
http://ipt.iobis.org/obiscanada/resource?r=trawl-catch-and-species-abundance-from-the-2019-gulf-of-alaska-international-year-of-the-salmon-expedition


In [74]:
## Download the Darwin Core Archive package
# download the [DwC-A](https://github.com/gbif/ipt/wiki/DwCAHowToGuide#what-is-darwin-core-archive-dwc-a) zip package. 
#
# To do that:
# 1. Collect the DwC-A zip url by parsing the **IPT** dataset html page, looking for the **Data as a DwC-A file** `download` link.  
# 1. We download the zip package to the file `OBIS_data/{dataset short name}.zip` (eg. `OBIS_data/habsos.zip`) 
                                          
import os

LOCAL_OBIS_DATA_PATH = "./OBIS_data"

# create the data folder if it doesn't already exist
os.makedirs(LOCAL_OBIS_DATA_PATH, exist_ok=True)

url = df['url'][0]
# print(f"DwC Archive URL: {url}")
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
size_raw = soup.find('td')

zip_download = size_raw.find('a').get('href')

vers = zip_download.split("=")[-1]
name = zip_download.split("=")[-2].replace("&v","")

fname = LOCAL_OBIS_DATA_PATH + '/' + name + '_v' + vers + '.zip'

print('Downloading ' + zip_download)
print('Downloading to ' + fname)
urllib.request.urlretrieve(zip_download, fname)
print('Complete.')

Downloading http://ipt.iobis.org/obiscanada/archive.do?r=trawl-catch-and-species-abundance-from-the-2019-gulf-of-alaska-international-year-of-the-salmon-expedition&v=1.1
Downloading to ./OBIS_data/trawl-catch-and-species-abundance-from-the-2019-gulf-of-alaska-international-year-of-the-salmon-expedition_v1.1.zip
Complete.


Use the [darwin core python reader package](https://python-dwca-reader.readthedocs.io/en/latest/index.html) to print out some metadata about the DwC-A package.

In [76]:
from dwca.read import DwCAReader

with DwCAReader(fname) as dwca:
    print(dwca.archive_path)
    root = dwca.metadata
    node = root.find('.//westBoundingCoordinate')
    print('%s: %s' % (node.tag, node.text))

./OBIS_data/trawl-catch-and-species-abundance-from-the-2019-gulf-of-alaska-international-year-of-the-salmon-expedition_v1.1.zip
westBoundingCoordinate: -147.527


Now lets do some automated ingest of all the data:
1. For each zip package
   1. Read the core file into a Pandas DataFrame.
   1. Concatenate all the core data into one large data frame.
   1. Print out some useful information as each package is processed.

In [79]:
from dwca.read import DwCAReader
from dwca.darwincore.utils import qualname as qn
import pandas as pd
import os

core_df = pd.DataFrame()
# occurrence only = OBIS_data/wod_2009.zip
# event = OBIS_data/ambon_cetaceans_2015.zip
for obis_zip in os.listdir(LOCAL_OBIS_DATA_PATH):
    if not obis_zip == 'unzipped':
        with DwCAReader(LOCAL_OBIS_DATA_PATH + '/'+obis_zip) as dwca:
            #eml = dwca.metadata
            print("\nReading: %s" % dwca.archive_path)
            print("Core type is: %s" % dwca.descriptor.core.type)
            print("Core data file is: %s" % dwca.descriptor.core.file_location)
            for ex in dwca.descriptor.extensions:
                print('Extensions: ',ex.type)
                
            df_init = dwca.pd_read(dwca.core_file_location, parse_dates = True)
            df_init['zip_name'] = obis_zip
            
            core_df = pd.concat(
                [core_df, df_init], 
                axis = 0, 
                ignore_index = True)


Reading: ./OBIS_data/trawl-catch-and-species-abundance-from-the-2019-gulf-of-alaska-international-year-of-the-salmon-expedition_v1.1.zip
Core type is: http://rs.tdwg.org/dwc/terms/Event
Core data file is: event.txt
Extensions:  http://rs.tdwg.org/dwc/terms/ResourceRelationship
Extensions:  http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact
Extensions:  http://rs.tdwg.org/dwc/terms/Occurrence


In [158]:
# === print some basic stuff about the package
print(f"filepath {dwca.archive_path}")

root = dwca.metadata

print("roles:")
for child in root.findall('.//role'):
    print("\t", child.tag, ":", child.text)
    
print("\n(# rows, # cols):")
print(core_df.shape)

print("\ncolumns in core file: ")
print(core_df.columns.to_list())

filepath ./OBIS_data/trawl-catch-and-species-abundance-from-the-2019-gulf-of-alaska-international-year-of-the-salmon-expedition_v1.1.zip
roles:
	 role : distributor
	 role : distributor
	 role : publisher

(# rows, # cols):
(2613, 18)

columns in core file: 
['id', 'type', 'eventID', 'parentEventID', 'samplingProtocol', 'eventDate', 'year', 'month', 'day', 'locationID', 'locality', 'minimumDepthInMeters', 'maximumDepthInMeters', 'decimalLatitude', 'decimalLongitude', 'coordinateUncertaintyInMeters', 'footprintWKT', 'zip_name']


In [160]:
# === check eventDate column
datetimes_df = pd.to_datetime(core_df['eventDate'], errors='coerce')

print(f"  valid `eventDate`s: {datetimes_df.count()}")
print(f"invalid `eventDate`s: {len(datetimes_df) - datetimes_df.count()}")

print(f"\ndatetime range: {datetimes_df.min()} / {datetimes_df.max()}")

print("\ninvalid eventDate values:")
invalid_values = core_df['eventDate'][datetimes_df.isna()].unique()
print(invalid_values)

if len(invalid_values) > 0:
    raise ValueError("WARN: invalid `eventDate` values! `eventDate` should be an ISO8601-formatted datetime. Please fix this and reupload to IPT.")

  valid `eventDate`s: 0
invalid `eventDate`s: 2613

datetime range: NaT / NaT

invalid eventDate values:
[nan '2019-02-19T00:02:00Z/2019-02-19T01:02:00Z'
 '2019-02-19T10:00:00Z/2019-02-19T11:00:00Z'
 '2019-02-21T11:21:00Z/2019-02-21T12:21:00Z'
 '2019-02-21T22:19:00Z/2019-02-21T23:19:00Z'
 '2019-02-22T09:49:00Z/2019-02-22T10:49:00Z'
 '2019-02-22T20:55:00Z/2019-02-22T21:55:00Z'
 '2019-02-23T06:08:00Z/2019-02-23T07:08:00Z'
 '2019-02-23T14:38:00Z/2019-02-23T15:38:00Z'
 '2019-02-23T23:12:00Z/2019-02-24T00:12:00Z'
 '2019-02-24T07:40:00Z/2019-02-24T08:40:00Z'
 '2019-02-24T16:48:00Z/2019-02-24T17:48:00Z'
 '2019-02-25T01:48:00Z/2019-02-25T02:48:00Z'
 '2019-02-25T11:05:00Z/2019-02-25T12:05:00Z'
 '2019-02-25T19:55:00Z/2019-02-25T20:55:00Z'
 '2019-02-26T04:07:00Z/2019-02-26T05:07:00Z'
 '2019-02-26T12:15:00Z/2019-02-26T13:15:00Z'
 '2019-02-26T20:13:00Z/2019-02-26T21:13:00Z'
 '2019-02-27T05:10:00Z/2019-02-27T06:10:00Z'
 '2019-02-27T13:02:00Z/2019-02-27T14:02:00Z'
 '2019-02-27T21:08:00Z/2019-02-27T22

ValueError: WARN: invalid `eventDate` values! `eventDate` should be an ISO8601-formatted datetime. Please fix this and reupload to IPT.