Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create list of identifiers and their canonical forms #51

Open
1 task done
amoeba opened this issue Sep 2, 2015 · 22 comments
Open
1 task done

Create list of identifiers and their canonical forms #51

amoeba opened this issue Sep 2, 2015 · 22 comments
Assignees

Comments

@amoeba
Copy link
Contributor

amoeba commented Sep 2, 2015

This issue stems from discussion on the Sep 2 2015 teleconference.

The literal representation of identifiers can come into our graphs in multiple forms, e.g.

We would like to have a canonical form to simplify lives for both producers and consumers.

  • Create a list of identifiers and their canonical forms as a file somewhere in the repo tree (e.g. DOI, ARK, etc)
@amoeba amoeba self-assigned this Sep 2, 2015
amoeba added a commit that referenced this issue Sep 9, 2015
@amoeba
Copy link
Contributor Author

amoeba commented Sep 14, 2015

I've made a lot of good progress on this but could use (1) some more work to track down information on some of the remaining identifiers and (2) some input on my current set of recommendations.

See https://github.com/ec-geolink/design/blob/master/data/dataone/canonical-identifiers.md

My current recommended form for each identifier is preceded by the text 'Recommend:'

@krisnadhi
Copy link
Contributor

Thanks @amoeba!

Regarding issue #61, do you think it is appropriate if on the base ontology, we create an IdentifierScheme class and generate instances for all those identifier schemes? So, we create a separate OWL file containing something like below?
<http://schema.geolink.org/dev/voc/identifier/scheme#doi> rdf:type gl:IdentifierScheme .
<http://schema.geolink.org/dev/voc/identifier/scheme#doi> rdfs:seeAlso <http://doi.org> .

@bob-arko
Copy link

Remind me .. since DataCite has already published URIs for these terms
eg.
http://purl.org/spar/datacite/isni
http://purl.org/spar/datacite/ark
http://purl.org/spar/datacite/doi
:
why can't we use these ?

(Apologies if this has already been answered.)

On Mon, Sep 14, 2015 at 03:01:33PM -0700, krisnadhi wrote:

Thanks @amoeba!

Regarding issue #61, do you think it is appropriate if on the base ontology, we create an IdentifierScheme class and generate instances for all those identifier schemes? So, we create a separate OWL file containing something like below?
<http://schema.geolink.org/dev/voc/identifier/scheme#doi> rdf:type gl:IdentifierScheme .
<http://schema.geolink.org/dev/voc/identifier/scheme#doi> rdfs:seeAlso <http://doi.org> .


Reply to this email directly or view it on GitHub:
#51 (comment)

@amoeba
Copy link
Contributor Author

amoeba commented Sep 14, 2015

@robertarko You responded just as I was typing this up:

Interesting idea. Could you maybe explain what doing that would add to our efforts? I think I like the idea.

The identifiers I'm researching the serializations for are all in the DataCite ontology as NamedIndividuals. For identifiers in our graphs that have schemes that already exist as NI's in DataCite, I would prefer to use DataCite. But I expect we will have identifiers not in DataCite and possibly not in another ontology so we will need to do something like you've suggested to accommodate them.

@bob-arko
Copy link

Right, I see your point. I was assuming the DataCite vocabulary
is comprehensive.

Out of curiosity , what are some ID types you need, that
DataCite doesn't have?

On Mon, Sep 14, 2015 at 03:14:08PM -0700, Bryce Mecum wrote:

@robertarko You responded just as I was typing this up:

Interesting idea. Could you maybe explain what doing that would add to our efforts? I think I like the idea.

The identifiers I'm researching the serializations for are all in the DataCite ontology as NamedIndividuals. For identifiers in our graphs that have schemes that already exist as NI's in DataCite, I would prefer to use on DataCite. But I expect we will have identifiers not in DataCite and possibly not in another ontology so we will need to do something like you've suggested to accommodate them.

@amoeba
Copy link
Contributor Author

amoeba commented Sep 14, 2015

I don't think there are any in the DataOne network. @mbjones might know otherwise, but I think identifiers in DataOne will always map directly to DataCite. Many of ours are DataCite local-resource-identifiers.

@bob-arko
Copy link

Okay.
So maybe we can just adopt DataCite's vocabulary outright?
(ie. No need to create a new set of classes in schema.geolink.org.)
Since DataCite has generic fallback options for ID types
like "local-resource-identifier-scheme" and "url", that's
probably everything we need.

On Mon, Sep 14, 2015 at 03:43:26PM -0700, Bryce Mecum wrote:

I don't think there are any in the DataOne network. @mbjones might know otherwise, but I think identifiers in DataOne will always map directly to DataCite. Many of ours are DataCite local-resource-identifiers.

@mbjones
Copy link
Contributor

mbjones commented Sep 14, 2015

I have yet to find one that we've needed in DataONE that isn't already in the DataCite vocabulary (especially since they support url and urn identifiers as types).

@krisnadhi
Copy link
Contributor

I didn't know the extent of DataCite vocabulary for identifier schemes, hence my earlier comment. If an appropriate one to use is available from DataCite already, I am also in favor of using it, instead of inventing our own URI.

@robertarko, are IMA identifier schemes covered by anything from DataCite other than the generic fallback options?

Referring to #61, hasIdentifierScheme is proposed to be changed to an object property. For this purpose, I suggest adding IdentifierScheme class, which would be aligned to datacite:IdentifierScheme. The identifier scheme URIs like http://purl.org/spar/datacite/ark would be an instance of this class.

@mbjones
Copy link
Contributor

mbjones commented Sep 15, 2015

@krisnadhi Sounds good to me.

@amoeba and I just discussed being careful about the definitions of our properties. For example, we should clarify that there are two use cases for hasIdentifierValue, one to get the machine-readable URI for the Identifier, and one to get the display form. The URI version of an identifier can and should be used as the LOD URI for the Identifier instance itself, except when an anonymous Identifier node is to be used. In which case, does the hasIdentifierValue property contain a literal showing the properly formatted syntax for displaying the identifier (e.g., "doi:10.xxxx/foo42") or the machine readable URI for the identifier (e.g., http://doi.org/10.xxxx/foo42). And, in the case of the latter, where does a client application find the display form for the Identifier? Maybe we need another property such as hasIdentifierDisplayValue.

@krisnadhi
Copy link
Contributor

Why can't we use the value pointed by hasIdentifierValue property for the display form of the identifier?

@mbjones
Copy link
Contributor

mbjones commented Sep 15, 2015

@krisnadhi We could, and that's how I originally thought of it. But, will everyone be using it that way? Would we be happy with the definition of the hasIdentifierValue property as "Provides a string literal value that represents the proper form of the identifier for display purposes."? So, for a DOI, this would be of the form "doi:10.xxxx/foo42".

@sparkji
Copy link
Contributor

sparkji commented Sep 15, 2015

Hi Bryce,

Could you also add 'GVP', 'SCAR', 'InterRige', 'IMA' and 'IGSN' to
canonical-identifiers list?

GVP
Smithsonian's Global Volcanism Program (GVP) announces new and permanent
unique identifiers (Volcano Numbers, or VNums) for volcanoes documented in
the Volcanoes of the World (VOTW) database maintained by GVP and accessible
at www.volcano.si.edu.

Source:

http://volcano.si.edu/list_volcano_holocene.cfm

Examples:

GVP:210010
http://volcano.si.edu/volcano.cfm?vn=210010

SCAR
The Scientific Committee on Antarctic Research (SCAR), through its
recommendations, expresses the hope that the present effort will contribute
to the adoption in Antarctica of the general principle of 'one name per
feature' by all Antarctic place naming authorities.
Source:

https://www1.data.antarctica.gov.au/aadc/gaz/scar/information.cfm

Examples:

SCAR:883

Notes:

It does not publish URIs that speak RDF

InterRidge
The InterRidge Global Database of Active Submarine Hydrothermal Vent
Fields, hereafter referred to as the “InterRidge Vents Database,” is to
provide a comprehensive list of active and inferred active (unconfirmed)
submarine hydrothermal vent fields for use in academic research and
education.

Source:

http://vents-data.interridge.org/about_the_database

Examples:

InterRidge:13-n-ridge-site
http://vents-data.interridge.org/ventfield/13-n-ridge-site

Notes:

It speaks RDF from version 3, and provide sparkql endpoint
http://vents-data.interridge.org/sparql
http://vents-data.interridge.org/sparql

IMA

http://www.ima-mineralogy.org/Minlist.htm

On Mon, Sep 14, 2015 at 2:13 PM, Bryce Mecum notifications@github.com
wrote:

I've made a lot of good progress on this but could use (1) some more work
to track down information on some of the remaining identifiers and (2) some
input on my current set of recommendations.

See
https://github.com/ec-geolink/design/blob/master/data/dataone/canonical-identifiers.md

My current recommended form for each identifier is preceded by the text
'Recommend:'


Reply to this email directly or view it on GitHub
#51 (comment).

@sparkji
Copy link
Contributor

sparkji commented Sep 15, 2015

Hi Bryce,

Could you also add 'GVP', 'SCAR', 'InterRige', 'IMA' and 'IGSN' to
canonical-identifiers list?

GVP
Smithsonian's Global Volcanism Program (GVP) announces new and permanent
unique identifiers (Volcano Numbers, or VNums) for volcanoes documented in
the Volcanoes of the World (VOTW) database maintained by GVP and accessible
at www.volcano.si.edu.

Source:

http://volcano.si.edu/list_volcano_holocene.cfm

Examples:

GVP:210010
http://volcano.si.edu/volcano.cfm?vn=210010

SCAR
The Scientific Committee on Antarctic Research (SCAR), through its
recommendations, expresses the hope that the present effort will contribute
to the adoption in Antarctica of the general principle of 'one name per
feature' by all Antarctic place naming authorities.
Source:

https://www1.data.antarctica.gov.au/aadc/gaz/scar/information.cfm

Examples:

SCAR:883

Notes:

It does not publish URIs that speak RDF

InterRidge
The InterRidge Global Database of Active Submarine Hydrothermal Vent
Fields, hereafter referred to as the “InterRidge Vents Database,” is to
provide a comprehensive list of active and inferred active (unconfirmed)
submarine hydrothermal vent fields for use in academic research and
education.

Source:

http://vents-data.interridge.org/about_the_database

Examples:

InterRidge:13-n-ridge-site
http://vents-data.interridge.org/ventfield/13-n-ridge-site

Notes:

It speaks RDF from version 3, and provide sparkql endpoint
http://vents-data.interridge.org/sparql
http://vents-data.interridge.org/sparql

IMA
International Mineralogical Association (IMA) publish the list contains
names and data for minerals which have been approved, discredited,
redefined and renamed and is the new revised master list of all
IMA-approved and grandfathered (i.e. inherited from before 1960) minerals.
Source:

http://www.ima-mineralogy.org/Minlist.htm

Examples:

IMA:2014-028

Notes:

It does not publish the URIs that speak RDF

IGSN
IGSN stands for International Geo Sample Number. The IGSN is 9-digit
alphanumeric code that uniquely identifies samples from our natural
environment and related sampling features. You can get an IGSN for your
sample by registering it in the System for Earth Sample Registration SESAR.
Source:

http://www.geosamples.org/igsnabout

Examples:

IGSN:HRV003M16

On Mon, Sep 14, 2015 at 2:13 PM, Bryce Mecum notifications@github.com
wrote:

I've made a lot of good progress on this but could use (1) some more work
to track down information on some of the remaining identifiers and (2) some
input on my current set of recommendations.

See
https://github.com/ec-geolink/design/blob/master/data/dataone/canonical-identifiers.md

My current recommended form for each identifier is preceded by the text
'Recommend:'


Reply to this email directly or view it on GitHub
#51 (comment).

@amoeba
Copy link
Contributor Author

amoeba commented Sep 15, 2015

@sparkji I'll add those to the list today. Thanks for providing all that information too -- it helps a lot!

@krisnadhi
Copy link
Contributor

@krisnadhi We could, and that's how I originally thought of it. But, will everyone be using it that way? Would we be happy with the definition of the hasIdentifierValue property as "Provides a string literal value that represents the proper form of the identifier for display purposes."? So, for a DOI, this would be of the form "doi:10.xxxx/foo42".

Would that be the only purpose of the value pointed to by hasIdentifierValue property? If that were the case, then it would be better to use rdfs:label and simply drop hasIdentifierValue property. IMHO, hasIdentifierValue implicitly captures our intention that the value it points to is really the identifier value and hence, consumers can assume that typical characteristics of identifiers hold, e.g., uniqueness of the value in the context of the identifier scheme. Obviously, the same value can still be used for display purposes.

One way to avoid confusion regarding how to display the identifier value is to augment the corresponding instance of Identifier class with an rdfs:label annotation whereby the label literal value is copied from the value pointed to by the hasIdentifierValue property.

@bob-arko
Copy link

If we represent DOIs as doi:10.xxxx/foo, then will we follow that
style consistently? ie. ISNIs (for organizations) would be isni:xyz,
ORCIDs (for persons) would be orcid:xyz, IGSNs (for samples)
would be igsn:xyz ,etc ?

On Mon, Sep 14, 2015 at 10:55:03PM -0700, Matt Jones wrote:

@krisnadhi We could, and that's how I originally thought of it. But, will everyone be using it that way? Would we be happy with the definition of the hasIdentifierValue property as "Provides a string literal value that represents the proper form of the identifier for display purposes."? So, for a DOI, this would be of the form "doi:10.xxxx/foo42".

@bob-arko
Copy link

PS. Coincidentally we're discussing similar issues in
the EarthCube workshop this week.

One thing that worries me, is how Publishers will implement
identifiers in journal articles. If they follow the
DataCite approach (scheme and value), then they may implement
business logic that always/automatically prepends the
scheme to the value. So we'll end up with DOIs that look
like

doi:doi:10.xxxx/foo42

On Mon, Sep 14, 2015 at 10:55:03PM -0700, Matt Jones wrote:

@krisnadhi We could, and that's how I originally thought of it. But, will everyone be using it that way? Would we be happy with the definition of the hasIdentifierValue property as "Provides a string literal value that represents the proper form of the identifier for display purposes."? So, for a DOI, this would be of the form "doi:10.xxxx/foo42".


Reply to this email directly or view it on GitHub:
#51 (comment)

@krisnadhi
Copy link
Contributor

If we represent DOIs as doi:10.xxxx/foo, then will we follow that
style consistently? ie. ISNIs (for organizations) would be isni:xyz,
ORCIDs (for persons) would be orcid:xyz, IGSNs (for samples)
would be igsn:xyz ,etc ?

Actually, @amoeba's note already indicates that this style is not necessarily appropriate for some identifier scheme.

One thing that worries me, is how Publishers will implement
identifiers in journal articles. If they follow the
DataCite approach (scheme and value), then they may implement
business logic that always/automatically prepends the
scheme to the value. So we'll end up with DOIs that look
like

doi:doi:10.xxxx/foo42

Is this business logic more on the data publishing or data consumption?
If this is about data publishing side, wouldn't it be a reasonable assumption that data publishers would ensure that their data are nicely formatted, e.g., they wouldn't publish a DOI literal that has two doi prefixes? So, we are okay as long as we have a set of recommended canonical forms that data publishers should use when publishing within GeoLink framework.
If this is more about a data consumption side, then I think, we shouldn't worry too much about how the business logic in the data consuming application is implemented as long as we use consistent styles when pushing out the data via GeoLink public endpoint.

amoeba added a commit that referenced this issue Sep 16, 2015
Added GVP, SCAR, InterRidge, IMA, IGSN as per request by @sparkji in
#51.
@mbjones
Copy link
Contributor

mbjones commented Sep 16, 2015

I agree with @krisnadhi that the consumers need to intelligently consume the identifiers because there are so many ways of representing things, and the recommended best practice for how to reference identifiers is a moving target. Plus, some groups like the DOI foundation make both a display recomendation (DOI:10.xxxx/foo) and a machine-readable link recommendation (e.g., http://dx.doi.org/10.xxxx/foo, http://doi.org/10.xxxx/foo over time). I think the issue here is that we need to know where these two types of information (display and link) will be recorded in glbase, and which is which. I'm not enamored of rdfs:label because it is used so loosely, and sometimes contains garbage text. I would prefer targeted properties for identifierDisplayForm and identifierLinkForm, regardless of the naming we end up with.

@amoeba
Copy link
Contributor Author

amoeba commented Oct 15, 2015

Just checking in on this issue as I don't think we've resolved it just yet.

From the comments, it looks like we need to decide between whether we want the display form, machine-readable form, or a web-resolvable form (or some combination of the three) to be stored in our graphs and how we want to do that. We could use rdfs:label for the display form, and glview:hasIdentifierValue for the machine-readable form, but we might want to create a new property like glview:hasIdentifierDisplayForm as @mbjones suggested. I'm pretty stuck on what to recommend.

Identifiers can have (1) a value, (2) a display form, and/or (3) an HTTP resolvable form, with some of these forms being the same for some identifiers. What should we be storing in our graphs?

@amoeba
Copy link
Contributor Author

amoeba commented Nov 20, 2015

As per our 2015/11/18 telecon, I will complete a first-draft of the identifier recommendations for the group to review. I'll have this done for the next telecon on 2015/12/2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants