## 5.4 Tyre-kicking

### What we doing here?

'The modernisation of key services, especially of services for data currently underdeveloped in the DiSSCo community, is of great importance for the overall improvement of the technical readiness level from the DiSSCo RI.

This task is focusing on construction plans for the improvement of technical infrastructure in the identified key areas of geo-collection data and taxonomic services. Geo-collection data are highly underrepresented in terms of available services for data mobilisation and publication. Thus, these services need special consideration in order to significantly increase DiSSCo technical readiness in the earth scientific domain. In addition, the harmonisation of life science taxonomic checklist services (e.g. Catalogue of Life) needs construction plans for the integration into DiSSCo architecture in order to exploit their full value as a taxonomic backbone.

This task will focus on the investigation of existing tools and provide necessary plans for seamless integration into the overall DiSSCo technical architecture.'  

[WP 5.4 overview + planning notes](https://docs.google.com/spreadsheets/d/1W9OmZ9qD6GNjmOtTT58F94-u9jrsK_gAK8MK_fnSJVY/edit#gid=1897609990)  
[WP 5.4 meeting notes](https://docs.google.com/document/d/1oplZIkp4jQ6txwElMaU_u23D9SEQmxKoIeMFNqwyZNA/edit)  

#### Data stores of interest
[Geocase](https://geocase.eu/)  
[Catalogue of Life](https://www.catalogueoflife.org/)  
[NSDIR](https://nsidr.org/#) 

#### Related packages:
[WP 6.1 notes](https://docs.google.com/document/d/176jxNxE_MkPgXn70XCx5eKtKqGCBQYTvA8D9Lh_qyEk/edit#heading=h.1cc0x84qr23r)  
[WP 6.1 event-storming notes](https://miro.com/app/board/o9J_lVwIZ9Q=/)  

#### Background links:
*Event sourcing*
* https://eventuate.io/
* https://www.eventstore.com/eventstoredb
* https://microservices.io/patterns/data/event-sourcing.html
* https://microservices.io/patterns/data/cqrs.html
* https://docs.microsoft.com/en-us/azure/architecture/patterns/event-sourcing
* https://www.confluent.co.uk/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/

*Event stream processing*
* https://en.wikipedia.org/wiki/Event_stream_processing

### What _you_ doing here, Sarah?

Best if you don't think about that too deeply tbh. But, since you're asking, today's side-quest is to do a first-pass of a few different data services (listed above) to see what they have in terms of the following:
1. Documentation
2. API endpoints
3. Any sort of event-streaming functionality (versioning, cause-of-change data, ...?)

... so this definitely doesn't need to be in a notebook, but HEY you're here now so might as well. Don't get distracted, Sarah, just do the thing.

In [2]:
import requests
from collections import Counter

___________
### 1. Geocase
____________

1.6m records at time of testing, from eight sources. Although it looks like they're pulling a bunch of the data from an older version of GeoCase? Something to ask about, ref this example [bio/geocase xml record](https://bc.geocollections.info/pywrapper.cgi?dsa=sarv) pulled from a `dataseturl` field

[Web app](https://geocase.geocollections.info/)  


#### Documentation

[API docs](https://api.geocase.eu/api-docs/) - Documented according to openapi, using Swagger. Brief, but their API uses default Solr syntax/operations so more detail can be found in the [Solr docs](https://lucene.apache.org/solr/guide/8_6/searching.html). 


#### Endpoints

`/solr` is the main search endpoint. It 'returns specimens using default Apache Solr response' so you can do Fairly Snazzy Searches (ranges, exists/does not exist, etc)  
  
`/repeat` is mystifying. Looks like it queries the original service? From the docs, the input param is the original url from the source service, so I would've assumed you used it in conjunction with fields returned by `/solr` endpoint to stitch together a call to the other service. 
  
`/mock` returns dummy data - something used for dev/testing, I suppose?  

In [3]:
gc_url_base = 'https://api.geocase.eu/v1'
gc_search = '/solr'

# query below brings back first 1k records that aren't from NHMUK, GIT or MfN
gc_params = {
    'q': '*', 
    'fq': 'datasetownerabbrev:(!UK-NHM !GIT !MfN)', # exclude results from these three institutes
    'rows': 1000
    }

# fl parameter limits the fields returned to just interesting ones - eg below
# 'fl': 'id,last_harvested_processing,datasetownerabbrev,datasourceurl,providerurl,occurrenceid,unitid',

gc_r = requests.get(gc_url_base + gc_search, params=gc_params)

In [4]:
gc_response = gc_r.json()
gc_records = gc_response['response']['docs']

# little peep at an example record
gc_records[:1]

[{'country_original': 'Estonia',
  'country': 'Estonia',
  'collectioncode': 'TAM',
  'is_mineral': True,
  'datasetownerabbrev': 'TAM',
  'stratigraphies': ['Llandovery', 'Raikküla Stage'],
  'last_harvested_processing': '2021-03-25T10:58:56Z',
  'id': '756664',
  'longitude': 25.743369,
  'recordbasis': 'Rock',
  'datasourcecountry': 'Estonia',
  'locality': 'Koigi old quarry, Järvamaa',
  'occurrenceid': 756664,
  'provideraddress': 'Ehitajate tee 5, 12616 Tallinn',
  'has_map': True,
  'cetaf_identifier': 'ee-nhm',
  'datasourceurl': 'https://bc.geocollections.info/pywrapper.cgi?dsa=sarv',
  'datasetowner': 'Estonian Museum of Natural History',
  'fullscientificname': 'sedimentary rocks',
  'latitude': 58.850797,
  'providercountry': 'Estonia',
  'stratigraphy': 'Raikküla Stage',
  'providerurl': 'https://geocollections.info/',
  'has_image': False,
  'recordURI': 'https://geocollections.info/specimen/189164',
  'coordinates': '58.850797,25.743369',
  'collectorname': 'Klaamann',
 

In [5]:
# use the first 1k records to get a list of record fields and data types, for funsies/so we don't miss any 
gc_fields = Counter([(k, type(v)) for x in gc_records for k, v in x.items()])

# generate a list of fields, data types and frequency in our sample
gc_fields.most_common()

[(('collectioncode', str), 1000),
 (('datasetownerabbrev', str), 1000),
 (('last_harvested_processing', str), 1000),
 (('id', str), 1000),
 (('recordbasis', str), 1000),
 (('datasourcecountry', str), 1000),
 (('occurrenceid', int), 1000),
 (('provideraddress', str), 1000),
 (('has_map', bool), 1000),
 (('cetaf_identifier', str), 1000),
 (('datasourceurl', str), 1000),
 (('datasetowner', str), 1000),
 (('providercountry', str), 1000),
 (('providerurl', str), 1000),
 (('has_image', bool), 1000),
 (('recordURI', str), 1000),
 (('providername', str), 1000),
 (('unitid', str), 1000),
 (('_version_', int), 1000),
 (('is_mineral', bool), 993),
 (('fullscientificname', str), 993),
 (('names', list), 993),
 (('country_original', str), 785),
 (('country', str), 785),
 (('locality', str), 785),
 (('stratigraphies', list), 710),
 (('longitude', float), 589),
 (('latitude', float), 589),
 (('coordinates', str), 589),
 (('collectorname', str), 528),
 (('stratigraphy', str), 450),
 (('highertaxon', s

### Geocase notes

##### Event-related things

- The only event-like field I can see in the `/solr` endpoint is `last_harvested_processing` (format like "2021-03-25T13:36:50Z").  
- Could query daily on this to get 'updated' records, but it's hard to tell from this angle if it reflects a harvest event (of whatever the source serves up) or if there's actually record/field level updates going on behind the scenes with geocase.  
- There's a `_version_` field for each record, which is handy, but it doesn't translate into a UT timestamp - needs more investigation.
- No additional audit/field-specific change data or event-adjacent endpoints, as far as I can tell.

#####  IDs

`id` and `occurrenceid` hold the same value, which appears to be a GC internal id?

`unitid` is a registration no. from the original site (in NHM's case, this is older-style accession numbers, rather than a guid - if it was something more persistent, maybe you could use it in conjunction with `datasourceurl` (domain of source catalogue.api) and `/repeat` to get to source data, although that's probably not the point...)

IDs for other systems:

`taxon_id` - not sure where this comes from... internal id?  
`taxon_id_pbdb` - [PBDB identifier](https://paleobiodb.org)  
`taxon_id_tol` - [Tree of Life identifier?](http://tolweb.org/tree/)  
`mindat_id` - [Mindat identifier](https://www.mindat.org)  
`cetaf_identifier` - [CETAF registry institution identifier](http://collections.naturalsciences.be/cpb/nh-collections)  

___________
### 2. CoL
____________

4.4m records at time of testing. Each checklist and name now has its own DOI, which is pretty groovy.  

API auth is via GBIF accounts, which is useful and an interesting bit of existing connectivity. CoL also expose both the original submitted species data and the CoLdp ['interpreted' version](https://www.catalogueoflife.org/about/colpipeline#standard-processing). 

- [API docs](https://api.catalogueoflife.org/)- Uses [OpenApi specifications](https://swagger.io/specification/). Documented using Swagger. Query/response object schemes are defined, but not much decsription of the purpose of each endpoint past input params and 'Try it out' - probably still working on it as the new API is still under development.  
- [API user mailing list](https://lists.gbif.org/pipermail/col-users/) - nice intro/set of examples for the new API in - [this post](https://lists.gbif.org/pipermail/col-users/2021-March/000004.html)  
- [Overview of the new API architecture](https://www.catalogueoflife.org/about/colpipeline)  
- [Catalogue Of Life Data Package](https://github.com/CatalogueOfLife/coldp#format-comparison): Not sure how relevant it is, but apparently it's a new data upload/download format, based on [frictionless data](https://frictionlessdata.io/) principles)  
- [Web app](https://www.catalogueoflife.org)  


#### Endpoints

So. Many. Endpoints. Life is much too short to go through them all, the main resources/paths are summarised below. Note that the dataset key to get the most recent version of the CoL database is `3LR` (Swagger doesn't like string values apparently, so current version id at time of writing as int: `2349`, example older version for testing: `2242`)


In [29]:
# Basic params for the dataset endpoint
col_base = "https://api.catalogueoflife.org"
col_dataset = "/dataset"
col_checklist_2021 = '3LR'
col_checklist_2020 = 2242
col_taxon_key = '5QQJ9'

`/dataset`: 
* Operations relating to CoL datasets (aka. checklist and all the underlying/contributory datasets), inc. versioning info.   
* Can look at dataset-level data
* Or search for individual taxa within e.g., the main namelist 
* Combining dataset and name IDs brings back records that include below object-level versioning info, see below for an example.

In [30]:
# Example call for a name within the main checklist dataset. Note the separate timestamps for last-modified in the taxon 
# record, the name record, the synonym list, and the names within the synonym list.  
col_r = requests.get(col_base + col_dataset + f"/{col_checklist_2021}/taxon/{col_taxon_key}/info")
col_r.json()

{'taxon': {'created': '2020-07-18T09:28:18.006855',
  'createdBy': 103,
  'modified': '2020-07-18T09:28:18.006855',
  'modifiedBy': 103,
  'datasetKey': 2349,
  'id': '5QQJ9',
  'sectorKey': 523,
  'name': {'created': '2020-07-18T09:28:18.006855',
   'createdBy': 103,
   'modified': '2020-09-30T03:30:12.626669',
   'modifiedBy': 103,
   'datasetKey': 2349,
   'id': 'aaaff60d-63da-434e-88cc-22d52dc3b45e',
   'sectorKey': 523,
   'homotypicNameId': 'aaaff60d-63da-434e-88cc-22d52dc3b45e',
   'scientificName': 'Picea abies var. abies',
   'rank': 'variety',
   'genus': 'Picea',
   'specificEpithet': 'abies',
   'infraspecificEpithet': 'abies',
   'code': 'botanical',
   'origin': 'source',
   'type': 'scientific',
   'parsed': True},
  'status': 'accepted',
  'origin': 'source',
  'parentId': '4HPZF',
  'remarks': 'IUCN status: LC',
  'scrutinizer': 'Farjon A.',
  'scrutinizerDate': '2014-01-31',
  'extinct': False,
  'environments': ['terrestrial'],
  'labelHtml': '<i>Picea abies</i> var.

* Some summary counts are available for import operations (example below) - assume there's more sophisticated logging on the admin side of things, but could watch this for crude indicator of changes/updates to the checklist dataset (assuming that the main events of interest from CoL are based around name designations)  
* Weirdly hard to find previous keys for the same dataset though? Unless I'm missing something...

In [32]:
# Example calls for the import records associated with two differenbt versions of the CoL checklist
col_r_2021 = requests.get(col_base + col_dataset + f"/{col_checklist_2021}/import").json()
col_r_2020 = requests.get(col_base + col_dataset + f"/{col_checklist_2020}/import").json()

# Earlier version
col_r_2020

[{'datasetKey': 3,
  'attempt': 23,
  'job': 'ProjectRelease',
  'state': 'finished',
  'started': '2020-12-01T15:26:12.145442',
  'finished': '2020-12-01T17:56:50.036638',
  'createdBy': 101,
  'bareNameCount': 325,
  'distributionCount': 2038375,
  'estimateCount': 1257,
  'mediaCount': 0,
  'nameCount': 3982070,
  'referenceCount': 2257009,
  'synonymCount': 1721671,
  'taxonCount': 2260074,
  'treatmentCount': 0,
  'typeMaterialCount': 0,
  'vernacularCount': 457308,
  'distributionsByGazetteerCount': {'text': 2038368, 'iso': 7},
  'extinctTaxaByRankCount': {'species': 38297,
   'infraspecific name': 539,
   'genus': 175,
   'subspecies': 73,
   'variety': 41,
   'subfamily': 13,
   'family': 9,
   'form': 6,
   'tribe': 5},
  'namesByCodeCount': {'zoological': 2333655,
   'botanical': 1465114,
   'bacterial': 15360,
   'virus': 8009},
  'namesByRankCount': {'species': 3306125,
   'genus': 185541,
   'subspecies': 182974,
   'variety': 171912,
   'infraspecific name': 62776,
   'fo

In [26]:
# Later version
col_r_2021

[{'datasetKey': 3,
  'attempt': 66,
  'job': 'ProjectRelease',
  'state': 'finished',
  'started': '2021-10-18T18:42:48.624667',
  'finished': '2021-10-19T00:46:20.432851',
  'createdBy': 102,
  'bareNameCount': 1361,
  'distributionCount': 1936307,
  'estimateCount': 1258,
  'mediaCount': 0,
  'nameCount': 4418386,
  'referenceCount': 2483978,
  'synonymCount': 1985107,
  'taxonCount': 2431918,
  'treatmentCount': 0,
  'typeMaterialCount': 0,
  'vernacularCount': 471815,
  'distributionsByGazetteerCount': {'text': 1536590,
   'mrgid': 393537,
   'iso': 6180},
  'extinctTaxaByRankCount': {'species': 112761,
   'genus': 13376,
   'subspecies': 3897,
   'variety': 2633,
   'family': 1380,
   'subgenus': 492,
   'subfamily': 451,
   'superfamily': 185,
   'form': 172,
   'tribe': 62,
   'order': 48,
   'suborder': 34,
   'subvariety': 12,
   'infraorder': 7,
   'subclass': 6,
   'class': 4,
   'infraclass': 4,
   'subtribe': 4,
   'supertribe': 3,
   'infraspecific name': 2,
   'superorde

* `/dataset/{dataset1}/diff/{diff2}` - can't get into this one (auth fail), but judging from the name, does it do an anti-union of two datasets? If this worked with most recent version of CoL checklist against an earlier version, that could be useful for clocking name changes.

#### Summary of other endpoints

`/importer`: ops for managing/monitoring uploads. Has ome calls that seem to duplicate content from `/dataset`, so suspect this one is more for collaborators to manage their data.  

`/version`: does what it says on the tin - think it's the api version, rather than the data.  

`/admin`: Can't auth into these, for fairly obvious reasons. One or two or the GET ops look like summary counts/stats for the underlying data, which could be helpful for diffing  

`/export`: ops for managing/monitoring downloads  

`/image`: Not sure what this is - looks like it might be more for embedding maps/trees in external sites?  

`/name`: only service under here is name matching, same as the GBIF service I think.  

`/nameusage`: looks like it's for references to taxa names, though I can't tell if it's the 'this species was first described in...' nomenclatural prov-type info, or 'this checklist dataset is derived from this research paper...'  

`/ndix`: woman_shrugging_emoji  

`/openapi`: Overview of paths and components  

`/parser`: Taxonomic name parsing?  

`/user`: User info (GBIF auth) with last logged in timestamp etc - would be useful with some of the 'updatedBy' values in other resources.  

`/vernacular`: Returns vernacular names that match a string  

`/vocab`: Enums and term lists for vocabs.  

___________
### 3. NSIDR
____________

[Github repo](https://github.com/DiSSCo/nsidr)  
[Web app](https://nsidr.org/)    

... there doesn't seem to be anything here? There are refs to documentation on the app's landing page, but no links - unless that specifically refers to corda docs? Some calls are buildable by using the GUI + grabbing the URL.