You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our curation on the PrefixCommons sub has uncovered several types of datasets whose identifiers are not resolvable at their original source. There is nothing DataMed can do about that except perhaps exhort the original sources to make better design decisions.
However, what is tricky is that Datamed is reusing some of the original (local) identifiers, but as far as I can tell, often without any contextualization or ability to find that which is identified. For instance below, how does one access what this record corresponds to? What is 3120 and at what level of granularity? After 30 minutes of searching the native site I still can't work out who issued that ID? where can I find it? If the access point for most GTex data is actually DB gap, it would be best to direct users there and to give the DB gap ID like phs000424.v3.p1 for example. If you're going to use an identifier that is not (uniquely) resolvable, it would be good to have at least a link of where / how to find it.
Below is a markup of the record in biocaddie, some drive-by observations are there too (unrelated to identifiers per se).
@jmcmurry I had a look at the datamed ingester pipeline for metabolomics workbench and there are clearly errors (hard coded value for the publication elements). When it comes to the identifier, the number shown '58' seems to correspond to the NIH MW Study_ID but it should read ST000058, not '58'
For GTEx it's not clear what's been indexed i.e. there are 1622 "datasets" indexed from GTex. They all have the same title and description. First we need to work out what these represent and what we should consider as a dataset for GTEx.
Exactly, DataMed or any similar platform really has to think carefully about:
the granularity at which to index records
the metadata that best distinguishes those records from each other
how to surface that metadata in a unified way
For instance, the publication above for 58 (eg ST000058) is not at all related to the specific record, but to the metabolomics workbench as a whole. Thus while the publication is relevant, it would be better placed in the record for the database, than for the record.
Our curation on the PrefixCommons sub has uncovered several types of datasets whose identifiers are not resolvable at their original source. There is nothing DataMed can do about that except perhaps exhort the original sources to make better design decisions.
However, what is tricky is that Datamed is reusing some of the original (local) identifiers, but as far as I can tell, often without any contextualization or ability to find that which is identified. For instance below, how does one access what this record corresponds to? What is 3120 and at what level of granularity? After 30 minutes of searching the native site I still can't work out who issued that ID? where can I find it? If the access point for most GTex data is actually DB gap, it would be best to direct users there and to give the DB gap ID like phs000424.v3.p1 for example. If you're going to use an identifier that is not (uniquely) resolvable, it would be good to have at least a link of where / how to find it.
Below is a markup of the record in biocaddie, some drive-by observations are there too (unrelated to identifiers per se).
Some thoughts on potential approaches identifier surrogacy now here: http://bit.ly/identifiersurrogacy
The text was updated successfully, but these errors were encountered: