Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeStatus - curation before uploading first vocabulary version #87

Closed
ManonGros opened this issue Mar 18, 2021 · 45 comments
Closed

TypeStatus - curation before uploading first vocabulary version #87

ManonGros opened this issue Mar 18, 2021 · 45 comments
Assignees
Labels
content Label for issue concerning vocabulary content occurrence priority:high

Comments

@ManonGros
Copy link
Collaborator

Here is a file to edit: https://drive.google.com/file/d/1WOgxAt3nIL2TVpXV06Qa8wNTY9Q7R5R7/view?usp=sharing

It contains:

  • the list of existing concepts
  • a list of the values already mapped to the concepts (they are all in the Hidden sheet/tab for now)
  • the GBIF verbatim values for this field that appear more than 10,000 times or in 5 or more datasets

Pease check instructions here: #70

@ManonGros ManonGros added the content Label for issue concerning vocabulary content label Mar 18, 2021
@ahahn-gbif
Copy link

I'd like to work on this one, please

@ahahn-gbif
Copy link

Some intermediate notes:

  • added ISOEPITYPE
  • check adding (legitimate?) or adding to parent (TYPE): TYPESTRAIN; TYPESERIES; METATYPE; HOMOTYPE(??);TYPEMATERIAL; PARATOPOTYPE; NEOPARATYPE; LECTOPARATYPE; CLONOTYPE (not really a type)
  • UNKNOWN: discarded again. If we do not know whether it is a type or not, it should be treated the same as any other NULL value, not labeled specifically just because the field has some other content
  • NOTATYPE: specifically used for a previously mislabeled specimens, to signify that the type status was marked incorrectly
  • handling of tentative values (?, "possibly") = we do not know. In the case of type status, it is still more important for a user to be able to find possible candidates and verify, so I do map this to the "straight" values. NB: this will likely be different for other vocabularies!
  • check "types" that are not nomenclatural or relate to names, not material (or unclear): CLONOTYPE; LOCOTYPE - remember this is also referenced in Checklistbank!
  • parent relationship: is the "parent" of an isosyntype the isotype (duplicate of...) or the syntype (one of a series...)? same for other combinations
     
    also see: http://gbif.github.io/parsers/apidocs/org/gbif/api/vocabulary/TypeStatus.html

@ahahn-gbif
Copy link

This is at a stage where I would appreciate a review. Do we have a process for this, or whom should I include, @timrobertson100?

@timrobertson100
Copy link
Member

timrobertson100 commented Apr 19, 2021

Thanks @ahahn-gbif

Each vocabulary can present different quirks, so it needs to be tailored to fit but I find that in general

  1. Consider if ALA or others may bring new needs (I'd assume unlikely here)
  2. Identify a good person to review it
  3. Introduce your methodology to them
  4. Ask them to review the parts relevant. In this case
    - The concepts and definitions
    - The hidden labels
    - The verbatim sheet (some people don't complete this)
  5. Ask me or @marcos-lg to import it into UAT for the author to then inspect it looks as desirable
  6. Announce availability on UAT here
  7. Import to production as soon as you feel confident no new info will come
    - currently, it is more tedious to change once in the production registry than repeating an import from a spreadsheet multiple times in UAT

In this instance you need domain-specific knowledge, so perhaps @mdoering would be a good reviewer?

@mdoering
Copy link
Member

@ahahn-gbif happy to review, just ping me

@ahahn-gbif
Copy link

Concerning the handling of verbatim data with containing uncertainty markers ("type?", "possible type" etc):

Policy decision to

  • serve the use case of letting data users know that a) there is a possible type and b) the assignation is uncertain. Assumption: a user searching explicitly for type material would rather examine a few uncertain cases than loose the information that they might exist. The uncertainty should still be indicated so that a publisher's data is not misrepresented (false accuracy).

For the vocabulary this means to

  • not map these verbatim combinations as hidden labels
  • leave it to the consumer (code) to handle verbatim values alternatively as exact matches or fuzzy match, and indicate fuzzy matching through an issue flag (or other suitable mechanism)

(after consultation with @mdoering, @timrobertson100 and @marcos-lg)

@mdoering
Copy link
Member

  • NOTATYPE: specifically used for a previously mislabeled specimens, to signify that the type status was marked incorrectly

I would recommend to rename this into NOT_A_TYPE so its clear. With all these cryptic type names NOTA_TYPE sounds a reasonable name of some kind of type.

@mdoering
Copy link
Member

parent relationship: is the "parent" of an isosyntype the isotype (duplicate of...) or the syntype (one of a series...)? same for other combinations

If you have to have a single parent it should be the syntype, as iso- just indicates a duplicate, the syntype is the more distinguished type status I would say.

@mdoering
Copy link
Member

Is it worth considering an OTHER status to populate in case the verbatim status cannot be parsed but is not null?

@mdoering
Copy link
Member

should ISOTYPE not have holotype as the parent instead of just type?

An isotype is any duplicate of the holotype

@mdoering
Copy link
Member

All isotypeXYZ should not have ISOTYPE as the parent. ISOTYPE is based on the holotype, all the others not.

  • ISOEPITYPE -> EPITYPE
  • ISOLECTOTYPE -> LECTOTYPE
  • ISONEOTYPE -> NEOTYPE
  • ISOPARATYPE -> PARATYPE
  • ISOSYNTYPE -> SYNTYPE

@mdoering
Copy link
Member

the COL type status vocabulary sets up a hierarchy by specifying a "base" status (=parent). Maybe thats worth looking into:
http://api.catalogueoflife.org/vocab/typestatus

@mdoering
Copy link
Member

I would map the following rather to NULL, maybe flagging an uncertainty issue.

ORIGINALMATERIAL | POSSIBLETYPE
ORIGINALMATERIAL | TYPESTATUSUNKNOWN

@mdoering
Copy link
Member

mdoering commented Apr 26, 2021

I would create a distinct new value for Neoparatype & Lectoparatype.
You often find their name also reversed as Paraneotype & Paralectotype. Ah, it is in the vocabulary like this already!

TYPE | Neoparatype
TYPE | neoparatype
TYPE | Lectoparatype
TYPE | lectoparatype

@mdoering
Copy link
Member

Paratopotype is mapped to PARATYPE and TYPE. Should be the former for all 3:

PARATYPE | PARATOPOTYPE
PARATYPE | Paratopotype
TYPE | paratopotype

@ahahn-gbif
Copy link

ahahn-gbif commented Aug 2, 2022

should ISOTYPE not have holotype as the parent instead of just type?

I am undecided here between the formal relationship between terms (there is a group of isotypes that gets further characterised as exisotype, plastoisotype etc), and relationships between objects (if there is an isotype, there also must be a holotype that the isotype relates to).

I think the vocabulary models the former (grab anything that is an isotype out of the bucket), but not necessarily the latter (if I search for holotypes, would I expect to also get isotypes delivered?) - but I may be confused.

Edit: I see that the CoL typestatus vocabulary (link above) does make holotype the parent of isotype. The purpose/use case may still be different there than our occurrence relevance (?)

@ahahn-gbif
Copy link

Is it worth considering an OTHER status to populate in case the verbatim status cannot be parsed but is not null?

Also undecided on this one. I would introduce this only if we would expect other type-related terms that cannot be mapped to any of the existing ones, and where introducing a new concept would be wrong for some reason. Do we want an option for parsing type-unrelated content, and if so, for what purpose? Or would it be a temporary bucket for collecting content that is either mis-mapped or omitted from the vocabulary in error?

@ahahn-gbif
Copy link

I would map the following rather to NULL, maybe flagging an uncertainty issue.

ORIGINALMATERIAL | POSSIBLETYPE
ORIGINALMATERIAL | TYPESTATUSUNKNOWN

I would agree about "TYPESTATUSUNKNOWN". For POSSIBLETYPE, the ORIGINALMATERIAL definition as "'type-suspicious' material" seems to fit rather well, so I am less sure here. I removed one, but kept the other for now.

@mdoering
Copy link
Member

mdoering commented Aug 2, 2022

Edit: I see that the CoL typestatus vocabulary (link above) does make holotype the parent of isotype. The purpose/use case may still be different there than our occurrence relevance (?)

COL might do it wrongly. I think you are right. An isotype is a duplicate of a holotype, but not itself a holotype.
I will change that in COL...

@mdoering
Copy link
Member

mdoering commented Aug 2, 2022

Hm. But ISOEPITYPE, ISOLECTOTYPE or ISONEOTYPE are not subclasses of ISOTYPE - they are all duplicates, but not from the holotype

@ahahn-gbif
Copy link

Thanks, I missed that. Just as ISOTYPE is a sibling of HOLOTYPE and child of TYPE, this would also apply to the other SYN~ names. Following the same logic, they would all be subclasses of TYPE, rather.

@ahahn-gbif
Copy link

Actually, the same would apply to the Allo~ (opposite sex of the holotype) and Para~ (everything but the holotype, though used in the description) series (?)

@mdoering
Copy link
Member

mdoering commented Aug 4, 2022

Indeed. Difficult to model. There are several unrelated properties/flags really that taxonomists have combined into a single word

@ahahn-gbif
Copy link

For handover due to intermediate absence:
Mostly done, and ok for the existing hidden values. Possibly to still consider, or for later addition if relevant:

CLONOTYPE | (botany) Herbarium specimens made from plants vegetatively propagated from (thus clones of) the same plant from which a type specimen was made. Clonotypes are of some use in documenting a type collection but have no status under the International Code of Botanical Nomenclature. The term is sometimes also used to refer to the living plants themselves.
METATYPE | a topotype or homeotype determined by the original author of its species
HOMEOTYPE | a biological specimen that has been carefully compared with and identified with an original or primary type
TYPESERIES | a group of representatives of a taxon (as a subspecies or species) selected to demonstrate the extent of variation of that unit

Otherwise, I think this could be imported now and edited later, unless there are any remaining concerns?

@CecSve
Copy link
Collaborator

CecSve commented Apr 15, 2024

@ahahn-gbif do you think it would make sense to include the nomenclatural code, e.g. botanical, zoological etc., as a tag (http://api.catalogueoflife.org/vocab/typestatus)? It would allow us and editors to manage the vocabulary based on the code applied.

I will follow up and add #87 (comment) before I upload to UAT.

I realise we have the following decision for uncertain values: #87 (comment) - could we create a pipelines issue for this flag so we can discuss how best to implement it @marcos-lg?

@CecSve
Copy link
Collaborator

CecSve commented Apr 15, 2024

@ahahn-gbif If I understand correctly, all syn-, para- and allo- concepts should be children of Type? We do not have a way to link siblings, so I will leave that to the description of the concepts.

Concept Parent
ALLEOTYPE LECTOTYPE
ALLONEOTYPE NEOTYPE
ALLOTYPE TYPE

@marcos-lg
Copy link
Contributor

I realise we have the following decision for uncertain values: #87 (comment) - could we create a pipelines issue for this flag so we can discuss how best to implement it @marcos-lg?

I prefer not to add the fuzzy matching not to overcomplicate things. I'd leave them unmapped and flag them somehow, maybe we can flag them if we recognize some keywords in the verbatim value such as possiblyor ??

@ahahn-gbif
Copy link

ahahn-gbif commented Apr 15, 2024

syn-, para- and allo- concepts should be children of Type?

We had revised the parent-child relationships; remaining errors excepted, the parent should be what is listed in the parent column, meaning this is hierarchical, and a parent (like Type) can have multiple children. This does not mean that Type is the parent for all - the relationship from C:Alloneotype to P:Neotype is correct. Is this going to be a problem?

@ahahn-gbif
Copy link

(...) to include the nomenclatural code (...)
I am not sure we have that information. It would not make sense in many cases, as the type status is not a particular botanical or zoological flavor. Where we do know that a type status only exists in a single code, maybe, but I am not knowledgeable enough to know where this would be the case.

@CecSve
Copy link
Collaborator

CecSve commented May 10, 2024

(...) to include the nomenclatural code (...)

I am not sure we have that information. It would not make sense in many cases, as the type status is not a particular botanical or zoological flavor. Where we do know that a type status only exists in a single code, maybe, but I am not knowledgeable enough to know where this would be the case.

Ok, we can leave it for now and add it later, if it makes sense.

@CecSve
Copy link
Collaborator

CecSve commented May 10, 2024

syn-, para- and allo- concepts should be children of Type?

We had revised the parent-child relationships; remaining errors excepted, the parent should be what is listed in the parent column, meaning this is hierarchical, and a parent (like Type) can have multiple children. This does not mean that Type is the parent for all - the relationship from C:Alloneotype to P:Neotype is correct. Is this going to be a problem?

No, this should not be a problem. I will stick with this setup then.

@CecSve
Copy link
Collaborator

CecSve commented May 10, 2024

I realise we have the following decision for uncertain values: #87 (comment) - could we create a pipelines issue for this flag so we can discuss how best to implement it @marcos-lg?

I prefer not to add the fuzzy matching not to overcomplicate things. I'd leave them unmapped and flag them somehow, maybe we can flag them if we recognize some keywords in the verbatim value such as possiblyor ??

Yes, I will leave them unmapped and comment again once I have compiled all the values that should be flagged.

@CecSve
Copy link
Collaborator

CecSve commented May 10, 2024

All isotypeXYZ should not have ISOTYPE as the parent. ISOTYPE is based on the holotype, all the others not.

* ISOEPITYPE -> EPITYPE

* ISOLECTOTYPE -> LECTOTYPE

* ISONEOTYPE -> NEOTYPE

* ISOPARATYPE -> PARATYPE

* ISOSYNTYPE -> SYNTYPE

@ahahn-gbif these are currently mapped to parent = type. Am i misunderstanding the comment or shouldn't they rather be mapped like suggested above? Or are all hierarchical structures correct now as stated here?

@ahahn-gbif
Copy link

@CecSve they should indeed rather be mapped as Markus suggested April 26, 2001 (your list above) - thanks!

@CecSve
Copy link
Collaborator

CecSve commented May 10, 2024

This seems to be a wrong designation of the label_En?

Concept Label_en
TypeSpecies typus generis
TypeGenus typus familiaris

@ahahn-gbif
Copy link

a wrong designation of the label_En

Technically speaking, probably yes (since it is not English). I was adding this as the technical term for the concept, but maybe just "type species" or "type species (typus generis)" would be closer to the label intention

@CecSve
Copy link
Collaborator

CecSve commented May 14, 2024

a wrong designation of the label_En

Technically speaking, probably yes (since it is not English). I was adding this as the technical term for the concept, but maybe just "type species" or "type species (typus generis)" would be closer to the label intention

I have added the typus *** to alternativeLabels_en and put type *** in Label_en instead - it probably does not really matter, but just for consistency

@CecSve
Copy link
Collaborator

CecSve commented May 14, 2024

Changed concepts to UpperCamelCase and added alternativeLabels_en typus to Type. Also removing mappings to verbatim values with ? or possible.

@CecSve
Copy link
Collaborator

CecSve commented May 15, 2024

For handover due to intermediate absence: Mostly done, and ok for the existing hidden values. Possibly to still consider, or for later addition if relevant:

CLONOTYPE | (botany) Herbarium specimens made from plants vegetatively propagated from (thus clones of) the same plant from which a type specimen was made. Clonotypes are of some use in documenting a type collection but have no status under the International Code of Botanical Nomenclature. The term is sometimes also used to refer to the living plants themselves. METATYPE | a topotype or homeotype determined by the original author of its species HOMEOTYPE | a biological specimen that has been carefully compared with and identified with an original or primary type TYPESERIES | a group of representatives of a taxon (as a subspecies or species) selected to demonstrate the extent of variation of that unit

Otherwise, I think this could be imported now and edited later, unless there are any remaining concerns?

Added the suggested concepts.

Should I add Paratopotype? It is in our verbatim values and has the following Wiki definition:

(biology) A paratype found in the same locality as the holotype

Also suggested here.

@ahahn-gbif
Copy link

Should I add Paratopotype?

Sounds reasonable to add. Parent Paratype.
There is no particular reason that it was left out, quite likely an oversight

@CecSve
Copy link
Collaborator

CecSve commented May 15, 2024

Should I add Paratopotype?

Sounds reasonable to add. Parent Paratype. There is no particular reason that it was left out, quite likely an oversight

Thanks - I will also add Lectoparatype then. already in as Paralectotype so have added it as an alternative label

@CecSve
Copy link
Collaborator

CecSve commented May 24, 2024

The vocabulary is not uploaded to prod and UAT.

@CecSve CecSve closed this as completed May 24, 2024
@CecSve
Copy link
Collaborator

CecSve commented Jun 10, 2024

@marcos-lg the following values are part of unmapped verbatim value strings and should be parsed and flagged during interpretation:

?
Possible
Possibly
Potential
Maybe
Perhaps

should the flag be type status uncertain? Other ideas @ahahn-gbif?

@ahahn-gbif
Copy link

should the flag be type status uncertain?

Maybe suspected type? This may be nitpicky, but type status uncertain would suggest to me that the uncertainty is about the "type of type" rather than the question whether or not it is a type at all.

@CecSve
Copy link
Collaborator

CecSve commented Jun 10, 2024

should the flag be type status uncertain?

Maybe suspected type? This may be nitpicky, but type status uncertain would suggest to me that the uncertainty is about the "type of type" rather than the question whether or not it is a type at all.

Thanks that makes sense. suspected type it should be then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Label for issue concerning vocabulary content occurrence priority:high
Projects
None yet
Development

No branches or pull requests

6 participants