In [59]:
import pandas as pd
import numpy as np
from pyobistools.validation.check_eventids import check_eventids, check_extension_eventids
pd.set_option('max_colwidth', None)
NaN = np.nan

### Info about this notebook series

This series of notebook is meant to serve as an educational tool to learn how to use the Pyobistools biodiversity data validation package: + https://github.com/cioos-siooc/pyobistools

Command to install Pyobistools (currently not hosted on Pypi)
+ pip install git+https://github.com/cioos-siooc/pyobistools@main#egg=pyobistools

Darwin Core documentation: 
+ https://dwc.tdwg.org/

Darwin Core file types required fields: 
+ https://ioos.github.io/bio_mobilization_workshop/01-introduction/index.html
+ https://ioos.github.io/bio_mobilization_workshop/04-create-schema/index.html

### Notebook to test Pyobistools' functions 'check_eventids' and 'check_extension_eventids'

##### Function 'check_eventids' description
The 'check_eventids' function is used with one file of either Darwin Core file types 'event_core' or 'occurrence_core', to report:
+ the absence of the fields 'eventid' and 'parenteventid'
+ duplicates values in 'event_ids'
+ if all 'parentEventIDs' have corresponding 'eventid' in a given file

##### Function 'check_eventids' arguments
+ data: Dataframe of the data to evaluate


##### Function 'check_extension_eventids' description
The 'check_extension_eventids' function is used with two Darwin Core files at a time - One file needs to be an 'event_core' while the other needs to be either an 'occurence_extension' or an 'extended_measurement_or_fact_extension' (event or occurrence in the latter case). The function reports if all eventIDs in an extension file have corresponding eventIDs in the core file

##### Function 'check_extension_eventids' arguments
+ event: Dataframe of the 'event_core' file to evaluate
+ extension: Dataframe of the 'extended_measurement_or_fact_extension' to evaluate
+ field (default = 'eventID'): The 'eventID' field name in the 'extended_measurement_or_fact_extension' file.

##### Relation to obistools package in R:
They are the python equivalent of check_eventids() and check_extension_eventids()
See documentation https://github.com/iobis/obistools#check-eventid-and-parenteventid

Load different types of DWC files:

In [60]:
event_core = pd.read_csv('test_event_core_check_eventids.csv')
event_core.head(3)


Unnamed: 0,eventID,eventDate,waterBody,cardinal_direction,decimalLatitude,decimalLongitude,distance_ocean_m,locality,site_number,station_number,...,p1_starting_time,p1_ending_time,p2_starting_volts,p2_ending_volts,p2_starting_time,p2_ending_time,p3_starting_volts,p3_ending_volts,p3_starting_time,p3_ending_time
0,ABCD-EF-1-1-2020-001,2020-07-07,Plancentia Bay,east,472713249,-538442982,200,North-east Placentia,1,1,...,945,1120,425,445,1210,1305,,,,
1,ABCD-EF-1-3-2020-002,2020-08-05,Plancentia Bay,east,4716323,-5350182,970,North-east Placentia,1,3,...,800,920,545,545,950,1125,,,,
2,ABCD-EF-1-2-2020-003,2020-07-10,Plancentia Bay,east,472732485,-538394947,650,North-east Placentia,1,2,...,925,1045,525,525,1055,1200,,,,


In [61]:
emof1 = pd.read_csv('test_emof1_check_eventids.csv')
emof1.head(3)

Unnamed: 0,eventID,site_number,station_number,habitat_number,habitatID,measurementID,measurementType,measurementValue,measurementUnit
0,ABCD-10-1-2020-022,10,1,1,habitat-10-1-1,ABCD-2020-habitat-10-1-1-01,Habitat type,riffle,
1,ABCD-10-1-2020-022,10,1,1,habitat-10-1-1,ABCD-2020-habitat-10-1-1-02,Habitat length,481,m
2,ABCD-10-1-2020-022,10,1,1,habitat-10-1-1,ABCD-2020-habitat-10-1-1-03,Habitat width,232,m


Try the check_eventids function:

In [62]:
check_eventids(event_core)

Unnamed: 0,field,level,row,message
0,parenteventid,error,,Field parenteventid is missing


Try the check_extension_eventids function:

In [63]:
check_extension_eventids(event_core, emof1, field = 'eventID').head()

Unnamed: 0,field,level,row,message
0,eventid,error,0,Field ABCD-10-1-2020-022 has no corresponding eventID in the core
1,eventid,error,1,Field ABCD-10-1-2020-022 has no corresponding eventID in the core
2,eventid,error,2,Field ABCD-10-1-2020-022 has no corresponding eventID in the core
3,eventid,error,3,Field ABCD-10-1-2020-022 has no corresponding eventID in the core
4,eventid,error,4,Field ABCD-10-1-2020-022 has no corresponding eventID in the core
