Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define and implement the GRSciColl master data management solution #319

Closed
ManonGros opened this issue Feb 25, 2021 · 11 comments
Closed

Define and implement the GRSciColl master data management solution #319

ManonGros opened this issue Feb 25, 2021 · 11 comments
Labels
GRSciColl Issues related to institutions, collections and staff

Comments

@ManonGros
Copy link
Contributor

ManonGros commented Feb 25, 2021

There are potentially multiple sources of truth for the metadata in the catalogue which needs to be resolved; a problem known as master data management. For example we have information available in a dataset metadata description, an existing GRSciColl entry and an Index Herbariorum record.

Define, implement and document the approach taken by the catalogue for handling differing views of metadata.

An approach could be as follows:

  • For each institution and collection entry in the catalogue, a single source of truth is identified for the key metadata (title, description etc). This may be one of:
  • The core metadata is never changed in GRSciColl for externally sourced entities, and edits must be applied in the system providing the master record.
    • The entries in GRSciColl may be enriched with the following fields:
      • Additional identifiers
@ManonGros ManonGros added the GRSciColl Issues related to institutions, collections and staff label Feb 25, 2021
@ManonGros
Copy link
Contributor Author

ManonGros commented Aug 24, 2021

What we want:

  • Every GRSciColl entry whether institution or collection will have source of truth (master record) with a type.
  • There will be four types (maybe more at some point): GRSciColl (meaning the entry is maintained in GRSciColl), GBIF Registry (the information comes from a dataset metadata or a publisher page), IH and CETAF.
  • The type will be associated with an identifier or some way to retrieve the information needed to update the record.
  • MachineTags will most likely be used to capture that information (source and type of source of information).
  • NB: Users won’t handle the matchineTags directly, we will need a wrapper.
  • It should be clear to users what is the source of truth.
  • The UI should allow editors, mediators, etc. to select a source of truth.
  • When the source of truth is chosen, the UI should show what can still be edited in the GRSciColl registry and what will be * overwritten by future synchronization. Ideally, this information (which fields can or cannot be edited will be captured in the backend).
  • We would presumably need a "Create a collection" based on a source. Something along the lines of "create a collection using this dataset".
  • For datasets as a sources of truth, dataset ingestion should trigger GRSciColl update. When working on CETAF, we will need some sort of crawler.

What we don’t want or don’t need:

  • Right now IH synch generates new entries when a new institution is added to IH. We don’t want to do the same for the other sources of truth. We briefly discussed making creation suggestions but given the low number of records in CETAF, it would just be easier if someone manually created entries in GRSciColl (from CETAF).
  • No need to work on NCBI BioCollections synch for now. The most requested sources have been GBIF datasets/publisher and CETAF. We will focus on that.
  • Given how complicated mapping the sources to GRSciColl is (might require transformation), we cannot have a configuration mapping file. But it would be nice to have the mapping available or documented somewhere that I can check.

Where we start:

  • We will first only focus on GBIF datasets and publisher links. This will allow us to iron out the details in a system we know. CETAF will come later.
  • Tim and Marcos will figure out how to set up the backend for this to happen.

@ManonGros
Copy link
Contributor Author

ManonGros commented Aug 24, 2021

Attempt at mapping fields:

Collection fields Dataset metadata fields
name title
description description
homepage dataset DOI? homepage
catalogueURL link to occurrences?
apiURL link GBIF API call to occurrences?
presevationTypes specimenPreservationMethod in collections (although we will need to map the terms too)
taxonomicCoverage taxonomicCoverages (find a way to aggregate the data) or inferred from occurrences
geography geographicCoverages (only the description part most likely) or inferred from occurrences
incorporatedCollections name in collections
Active default: True
identifier identifier in collections + datasetDOI?
address publishingOrganization address
city publishingOrganization city
province publishingOrganization province
postalCode publishingOrganization postalCode
country publishingOrganization country
contacts contacts (we should probably refine the mapping here)

NB: the Institution and Code, which are mandatory fields cannot be inferred from the EML. The users will have to fill those fields. Perhaps we should also encourage the users to add a physical address? We could infer the address from the publisher as Marcos mentioned below.

Institution fields Organization fields
name title
description description
homepage homepage
phone phone
email email
catalogueURL link to occurrences?
apiURL link GBIF API call to occurrences?
latitude latitude
longitude longitude
logoUrl logoUrl
address address
city city
province province
postalCode postalCode
country country
Active default: True
contacts contacts (we should probably refine the mapping here)

NB: Same comment about codes as for collection.

@MortenHofft
Copy link
Member

MortenHofft commented Aug 25, 2021

collection homepage
Perhaps just use the dataset homepage (the field homepage)

collection identifiers
I'm not sure we can use the collections.identifiers for much. At least some curation would be needed. Below is a sample of how they are used.

{
"key": "FMB",
"doc_count": 46
},
{
"key": "IAvH-CT",
"doc_count": 37
},
{
"key": "IAvH-A",
"doc_count": 29
},
{
"key": "IAvH-E",
"doc_count": 24
},
{
"key": "4ec2b246-f5fa-4b90-9a8d-ddafc2a3f970",
"doc_count": 21
},
{
"key": "Registro Nacional de Colecciones Biológicas: 207",
"doc_count": 19
},
{
"key": "Registro Nacional de Colecciones Biológicas: 3",
"doc_count": 19
},
{
"key": "IAvH-Am",
"doc_count": 18
},
{
"key": "IAvH-R",
"doc_count": 18
},
{
"key": "Registro Nacional de Colecciones Biológicas: 158",
"doc_count": 17
}

@MortenHofft
Copy link
Member

MortenHofft commented Aug 25, 2021

Also should it perhaps be possible to map dataset => institution ?
E.g. https://www.gbif.org/dataset/288e1f4c-7c09-4604-ad19-920a61c55462 seem to be an institution. They talk about their collectionS in plural.

And they list their collections
https://api.gbif.org/v1/dataset/288e1f4c-7c09-4604-ad19-920a61c55462

UPDATE: in this case the publisher would be natural to use I guess. So perhaps no need after all :)
https://www.gbif.org/publisher/748bb006-8e16-4703-9936-8be1286aac30

@MortenHofft
Copy link
Member

taxonomicCoverage taxonomicCoverages (find a way to aggregate the data)

Perhaps we could fall back to occurrence metrics when/if it isn't filled?

@marcos-lg
Copy link
Contributor

For the collection-dataset mapping:

  • homepage -> I'd also use the dataset homepage
  • catalogueUrl and apiURL: shouldn't they point to the collection page and collection api instead of the occurrences?
  • maybe we can take the address from the publisher organization?

For the institution-organization mapping:

  • we could use the abbreviation field as code perhaps?
  • catalogueUrl and apiURL: same as for the other mapping

For both mappings, for the contacts I think we could check if the person exists in grscicoll and create a new person otherwise. It's not ideal since we'll be kind of duplicating people and if the person changes in the organization or the dataset, should we update it in grscicoll too? or if it's deleted do we still keep this person in grscicoll? with the current model that we have for persons I don't think there isn't a good solution unless we improve the model first.

@ManonGros
Copy link
Contributor Author

The problem with inferring a collection's parent institution from a dataset title (or publisher) is that it might generate duplicates if the spellings are different than what we have in GRSciColl. Plus, what if there are several institutions in GRSciColl matching the same name? I think someone will have to check manually which institution should be the parent one, it cannot really happen automatically.

Concerning using occurrences and publisher to infer some content:
It depends on whether we will be using the EML only for synch or the published dataset (which will have the publisher).

  • Using the EML only would mean that we can have people link data from IPTs that aren't necessarily published on GBIF (like OBIS for example).
  • But using the GBIF dataset would probably be easier. Plus, we could infer:
    • the taxonomic and geographic coverages from the occurrences
    • infer the address from the publisher address

I don't think the abbreviation field is part of the become a publisher form so I doubt it will be filled very often. We probably cannot count on it very much.

I agree that we should first check if the contact exists in GRSciColl before creating a new one.
We should at least be able to update changes in contacts ("this person is now in charge of that" type of changes). Ideally, we should probably update changes in email addresses. phones, etc. But I know that many datasets have the same contacts, I can imagine some conflicts if the person is not updated everywhere. What would be possible?

The definitions of the catalogueURL field we wrote is "If your specimens are digitized and available online, you can put here the link to access them".
For the apiURL, it is "Same as Catalogue URL, if your institution exposes its records via an API (relevant mainly for iDigBio entries)."
That's why I was suggesting to put the links to occurrences. Does it make sense? We can also leave those fields empty.

@marcos-lg
Copy link
Contributor

But I know that many datasets have the same contacts, I can imagine some conflicts if the person is not updated everywhere. What would be possible?

Yes, that can happen. This complicates things. If we don't want to have conflicts we'd have to "duplicate" all the contacts and keep a link between them so we know for sure to what grscicoll person they refer.

The definitions of the catalogueURL field we wrote is "If your specimens are digitized and available online, you can put here the link to access them".
For the apiURL, it is "Same as Catalogue URL, if your institution exposes its records via an API (relevant mainly for iDigBio entries)."
That's why I was suggesting to put the links to occurrences. Does it make sense? We can also leave those fields empty.

I'm not sure. I guess some collections might have records in multiple datasets. We could have a link to the occurrences in the institution/collection page.

@ManonGros
Copy link
Contributor Author

I'm not sure. I guess some collections might have records in multiple datasets. We could have a link to the occurrences in the institution/collection page.

You are right, it gets a bit complicated. I think we should leave those empty by default and the users can always fill them in.

@marcos-lg
Copy link
Contributor

marcos-lg commented Oct 18, 2021

As agreed with the others, we'll map the specimenPreservationMethod in the collections field of the dataset to the presevationTypes of a grscicoll collection like this:

specimenPreservationMethod presevationTypes
NO_TREATMENT empty
ALCOHOL SAMPLE_FLUID_PRESERVED
DEEP_FROZEN STORAGE_FROZEN_BETWEEN_MINUS_132_AND_MINUS_196
DRIED SAMPLE_DRIED
DRIED_AND_PRESSED SAMPLE_DRIED, SAMPLE_PRESSED
FORMALIN SAMPLE_FLUID_PRESERVED
REFRIGERATED STORAGE_REFRIGERATED
FREEZE_DRIED SAMPLE_FREEZE_DRYING
GLYCERIN SAMPLE_FLUID_PRESERVED
GUM_ARABIC SAMPLE_FLUID_PRESERVED
MICROSCOPIC_PREPARATION SAMPLE_SLIDE_MOUNT
MOUNTED SAMPLE_OTHER
PINNED SAMPLE_PINNED
OTHER STORAGE_OTHER

@marcos-lg
Copy link
Contributor

Deployed to PROD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GRSciColl Issues related to institutions, collections and staff
Projects
None yet
Development

No branches or pull requests

3 participants