Identifier contextualization #291

jmcmurry · 2017-09-07T00:13:29Z

Our curation on the PrefixCommons sub has uncovered several types of datasets whose identifiers are not resolvable at their original source. There is nothing DataMed can do about that except perhaps exhort the original sources to make better design decisions.

However, what is tricky is that Datamed is reusing some of the original (local) identifiers, but as far as I can tell, often without any contextualization or ability to find that which is identified. For instance below, how does one access what this record corresponds to? What is 3120 and at what level of granularity? After 30 minutes of searching the native site I still can't work out who issued that ID? where can I find it? If the access point for most GTex data is actually DB gap, it would be best to direct users there and to give the DB gap ID like phs000424.v3.p1 for example. If you're going to use an identifier that is not (uniquely) resolvable, it would be good to have at least a link of where / how to find it.

Below is a markup of the record in biocaddie, some drive-by observations are there too (unrelated to identifiers per se).

Some thoughts on potential approaches identifier surrogacy now here: http://bit.ly/identifiersurrogacy

jmcmurry · 2017-09-07T04:45:09Z

Here's another example:

proccaserra · 2017-09-07T15:49:10Z

@jmcmurry I had a look at the datamed ingester pipeline for metabolomics workbench and there are clearly errors (hard coded value for the publication elements). When it comes to the identifier, the number shown '58' seems to correspond to the NIH MW Study_ID but it should read ST000058, not '58'

ianfore · 2017-09-15T13:38:52Z

For GTEx it's not clear what's been indexed i.e. there are 1622 "datasets" indexed from GTex. They all have the same title and description. First we need to work out what these represent and what we should consider as a dataset for GTEx.

jmcmurry · 2017-09-15T16:31:10Z

Exactly, DataMed or any similar platform really has to think carefully about:

the granularity at which to index records
the metadata that best distinguishes those records from each other
how to surface that metadata in a unified way

For instance, the publication above for 58 (eg ST000058) is not at all related to the specific record, but to the metabolomics workbench as a whole. Thus while the publication is relevant, it would be better placed in the record for the database, than for the record.

ianfore added the Transformation label Sep 15, 2017

zongnansu1982 added the Data Ingestion Process label Sep 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identifier contextualization #291

Identifier contextualization #291

jmcmurry commented Sep 7, 2017 •

edited

jmcmurry commented Sep 7, 2017

proccaserra commented Sep 7, 2017

ianfore commented Sep 15, 2017

jmcmurry commented Sep 15, 2017

Identifier contextualization #291

Identifier contextualization #291

Comments

jmcmurry commented Sep 7, 2017 • edited

jmcmurry commented Sep 7, 2017

proccaserra commented Sep 7, 2017

ianfore commented Sep 15, 2017

jmcmurry commented Sep 15, 2017

jmcmurry commented Sep 7, 2017 •

edited