Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifier contextualization #291

Open
jmcmurry opened this issue Sep 7, 2017 · 4 comments
Open

Identifier contextualization #291

jmcmurry opened this issue Sep 7, 2017 · 4 comments

Comments

@jmcmurry
Copy link

jmcmurry commented Sep 7, 2017

Our curation on the PrefixCommons sub has uncovered several types of datasets whose identifiers are not resolvable at their original source. There is nothing DataMed can do about that except perhaps exhort the original sources to make better design decisions.

However, what is tricky is that Datamed is reusing some of the original (local) identifiers, but as far as I can tell, often without any contextualization or ability to find that which is identified. For instance below, how does one access what this record corresponds to? What is 3120 and at what level of granularity? After 30 minutes of searching the native site I still can't work out who issued that ID? where can I find it? If the access point for most GTex data is actually DB gap, it would be best to direct users there and to give the DB gap ID like phs000424.v3.p1 for example. If you're going to use an identifier that is not (uniquely) resolvable, it would be good to have at least a link of where / how to find it.

Below is a markup of the record in biocaddie, some drive-by observations are there too (unrelated to identifiers per se).

biocaddie_id_ambiguity

Some thoughts on potential approaches identifier surrogacy now here: http://bit.ly/identifiersurrogacy

@jmcmurry
Copy link
Author

jmcmurry commented Sep 7, 2017

Here's another example:
biocaddie_id_ambiguity2

@proccaserra
Copy link

@jmcmurry I had a look at the datamed ingester pipeline for metabolomics workbench and there are clearly errors (hard coded value for the publication elements). When it comes to the identifier, the number shown '58' seems to correspond to the NIH MW Study_ID but it should read ST000058, not '58'

@ianfore
Copy link
Contributor

ianfore commented Sep 15, 2017

For GTEx it's not clear what's been indexed i.e. there are 1622 "datasets" indexed from GTex. They all have the same title and description. First we need to work out what these represent and what we should consider as a dataset for GTEx.

@jmcmurry
Copy link
Author

Exactly, DataMed or any similar platform really has to think carefully about:

  • the granularity at which to index records
  • the metadata that best distinguishes those records from each other
  • how to surface that metadata in a unified way

For instance, the publication above for 58 (eg ST000058) is not at all related to the specific record, but to the metabolomics workbench as a whole. Thus while the publication is relevant, it would be better placed in the record for the database, than for the record.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants