reference and refererencesets should allow for multiple names. #518

diekhans · 2015-12-16T22:16:30Z

Sequence data often has multiple equivalent names. It would be very useful for GA4GH to represent and query on multiple different names.

for instance GRCh38.p2 is also GCF_000001405.28
Chromosome 1 has the names 1, CM000663.2, NC_000001.11, and chr1

david4096 · 2016-01-05T19:36:42Z

There are a few different ways to approach this in my mind, but first let me outline the current state of reference and referencesets.

References have a required name and an optional array of sourceAccessions. Reference Sets have an optional name, an optional assemblyId, and an array of sourceAccessions. Both search endpoints have the ability to name an accession to search on, which is a case-sensitive exact match.

The SearchReferencesRequest document allows for searching using accession but not name. If searching references by name is a desired use case, let's make sure that makes it in there.

Can searching within sourceAccessions serve this feature?

If not, we might make the name field an array so that a single record can be found using more than one search query. (q=chr1 or q=1) This would leave the request and response records more or less intact.

Alternatively, we might benefit from using ExternalIdentifier's throughout the system. This is the view of a software person, not a biologist, though! Even if sourceAccessions could serve the purpose for references and reference sets, the issue of multiple names arises for naming variants across multiple databases, for example. This would allow the same search semantics to exist across all the record types that should may need to be identified in multiple ways.

record ExternalIdentifier {
  /**
  The source of the identifier.
  (e.g. `Ensembl`)
  */
  string database;

  /**
  The ID defined by the external database.
  (e.g. `ENST00000000000`)
  */
  string identifier;

  /**
  The version of the object or the database
  (e.g. `78`)
  */
  string version;
}

macieksmuga · 2016-01-05T19:45:45Z

ExternalIdentifier is defined in sequenceAnnotations.avdl as part of the RNA seq PR. #517

diekhans · 2016-01-05T20:31:20Z

ExternalIdentifier is in common.avdl in master. I am a little unsure about the definition, since it's never been used.

mbaudis · 2016-01-05T20:59:17Z

My take on ExternalIdentifier:

yes, should be defined as consistent object; and the attributes look o.k. (though I would add an optional URI)
not sure about the "External" part in the name; your external identifier is my local one ... More like a fully qualified identifier or whatever it could be called etc.
this is one of those "helper" objects which need a common home, like OntologyTerm, GeographicLocation; moving those to common.avdl or something similar but new would make sense (spattering such objects into both common.avdl and metadata.avdl does not)

saupchurch · 2016-01-05T21:54:26Z

Turns out ExternalIdentifier was defined in both sequenceAnnotations.avdl and common.avdl in the rna branch. I just pushed an update removing it from sequenceAnnotations.avdl.

macieksmuga · 2016-01-06T04:57:29Z

Thanks Sean. I'm a bit disturbed that the Avro schema checker run during Travis CI (that helped us find a couple of other genuine schema errors) didn't pick this duplication up.

pgrosu · 2016-01-06T06:27:12Z

Maciek, I don't believe PR #517 is merged yet for the check to catch this, as only the common.avdl schema is merged.

diekhans · 2016-01-06T07:00:44Z

@macieksmuga the two definitions exist in different namespaces, so they do collide.

Maciek Smuga-Otto notifications@github.com writes:

Thanks Sean. I'm a bit disturbed that the Avro schema checker run during Travis
CI didn't pick this duplication up.

—
Reply to this email directly or view it on GitHub.*

pgrosu · 2016-01-06T17:21:46Z

Hmm...maybe I'm doing something funny, but when I check the two files, their namespace seem to be the same:


$ head -n1 common.avdl
@namespace("org.ga4gh.models")

$ head -n1 sequenceAnnotations.avdl
@namespace("org.ga4gh.models")

$ echo `head -n1 common.avdl` `head -n1 sequenceAnnotations.avdl` | awk '{ print ($1 == $2) ? "true" : "false" }'
true

Which script and parameters are performing the checking?

~p

diekhans · 2016-01-06T18:18:00Z

ouch, they are in the same name space... another good reason to ditch avro

david4096 · 2016-04-26T17:12:16Z

Does the optional array of sourceAccessions fulfill this, or should we propose a schema change to add an ExternalIdentifier array to references?

mbaudis · 2016-04-26T19:12:17Z

@david4096 Just a warning that "accession" is a loaded base term, which should be defined first (I would go for properly prefixed internal or external id and/or URI). We started this before, but it slipped through the "don't define ahead without a use case" gap. Maybe there is one here.

diekhans · 2016-04-28T15:11:42Z

X vs chrX are not accessions, they are alternate names. A search by needs to return match either of these.

macieksmuga · 2016-04-28T16:10:09Z

@diekhans I'm trying to nail down this use case: Does a search by name such as "1" need to return a reference named "chr1", and vice versa? Is this (along with "chr2",... "chrX, "chrY") the ONLY heuristic to build in to a name search, or are there others that we should be aware of?

macieksmuga · 2016-04-28T16:14:16Z

Also, are searches by accessions for individual references still a valid use case? If so, I'm not sure where I'd find these accessions in the data: Given a single multi-record FASTA file of GRCh37, here's a typical header for one of the chromosomes:

>3 dna:chromosome chromosome:GRCh37:3:1:198022430:1

What part of that, if any, constitutes an accession?

richarddurbin · 2016-04-28T17:14:55Z

No. A search by name “1” should only return the reference with name “1”. There should be at most one such reference in a referenceSet.
To get something with name “chr1” you need to search for “chr1”.

It is fine to start from the position that we have now that people know for a given dataset whether it uses references with the “1” or “chr1”
convention. If one doesn’t know it is easy to try both. The thing that no user would know is what the internal id is for chromosome 1 - there
clearly has to be a way to find that.

Richard

On 28 Apr 2016, at 17:10, Maciek Smuga-Otto notifications@github.com wrote:

@diekhans https://github.com/diekhans I'm trying to nail down this use case: Does a search by name such as "1" need to return a reference named "chr1", and vice versa? Is this (along with "chr2",... "chrX, "chrY") the ONLY heuristic to build in to a name search, or are there others that we should be aware of?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub #518 (comment)

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

diekhans · 2016-04-28T18:27:13Z

there is no need for heuristic name mapping, which can be unpredictable. The names are precisely defined as part of the assembly release:
ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.28.assembly.txt

the accession is not contained in the string:

3 dna:chromosome chromosome:GRCh37:3:1:198022430:1

diekhans · 2016-04-28T18:42:06Z

While we all wish genomics had not ended up with multiple names
for the same reference sequence, it is the reality. It is
enough of a reality that NCBI now the naming part of the
assembly release:

ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.28.assembly.txt

Chromosome 1 has both the names 1' andchr1'. As well as two
accessions (CM000663.2, NC_000001.11).

If we don't bow to this reality and allow multiple names for
references, we will end up needing to have two copies of each
ReferenceSet. It will make for data interoperability between
datasets hard.

We could have a name translation service in the API, however the
problem seems restricted enough that just having an array of
names in each Reference record should solve the problem. A
single name and array of aliases would be fine too.

Internal ids are not intended to be know; they must always be
looked up. Given that (GRCh38',1') and (GRCH38,chr1')
both name the same reference chromosome, they would map to the
same internal ids and object.

Richard Durbin notifications@github.com writes:

No. A search by name “1” should only return the reference with name “1”. There
should be at most one such reference in a referenceSet.
To get something with name “chr1” you need to search for “chr1”.

It is fine to start from the position that we have now that people know for a
given dataset whether it uses references with the “1” or “chr1”
convention. If one doesn’t know it is easy to try both. The thing that no user
would know is what the internal id is for chromosome 1 - there
clearly has to be a way to find that.

Richard

On 28 Apr 2016, at 17:10, Maciek Smuga-Otto notifications@github.com wrote:

@diekhans https://github.com/diekhans I'm trying to nail down this use
case: Does a search by name such as "1" need to return a reference named
"chr1", and vice versa? Is this (along with "chr2",... "chrX, "chrY") the ONLY
heuristic to build in to a name search, or are there others that we should be
aware of?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub <https://github.com/ga4gh/
schemas/issues/518#issuecomment-215479968>

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub*

mbaudis · 2016-04-28T18:55:50Z

@richarddurbin @diekhans
I'm a but confused here. We don't have to define what's at the backend. We just need to document

queries for the reference assembly representing human chromosome 1 in the human GRCh37 genome build

... and nobody cares about the backend annotation, e.g. if this refers to chr1 or 1? We do not want to build column format databases here.

Problem here is more in the generalisation, giving recommendations for how to deal with every conceivable genome. But there should be people with informed opinions?!

(As a side note, there is no a priori reason for having a separation by chromosome; should be just a base position address prefix).

diekhans · 2016-04-28T19:27:26Z

@mbaudis naming doesn't impose any structure on the back end. One has to be able to come in with identifier obtained from outside the API and access data.

diekhans · 2016-04-30T01:09:25Z

We need to resolve this issue, as it is handicapping out ability to deploy real-world data. Before putting together a pull request, I would like to get consensus.

A given reference can have multiple names. This could be done by changing Reference.nameto an array names or by adding a second field aliases.
All of the reference names in aReferenceSet must still map to one and only one Reference.
All searches by name will match against any of the Reference object's names/aliases

@richarddurbin, will you please explain your above concerns?

mbaudis · 2016-05-02T06:53:56Z

@diekhans I would prefer to see this solved in a universal way, and would prefer to have this done through a general alternativeNames | aliases ... (array, string) attribute, which also could be used in other records. This is something we need e.g. for capturing alternative sample identifiers anyway; and if you change name to array you would have to do this everywhere for consistency.

richarddurbin · 2016-05-02T09:46:08Z

This is complicated. Our data model for names and identifiers on references has drifted over the last couple of years.
I spent a long time sorting out what has happened, and have sent a longer email to a smaller subset of people rather
than the whole list, to try to reduce the confusion that I fear will follow.

Richard

On 30 Apr 2016, at 02:09, Mark Diekhans notifications@github.com wrote:

We need to resolve this issue, as it is handicapping out ability to deploy real-world data. Before putting together a pull request, I would like to get consensus.

A given reference can have multiple names. This could be done by changing Reference.nameto an array names or by adding a second field aliases.
All of the reference names in aReferenceSet must still map to one and only one Reference.
All searches by name will match against any of the Reference object's names/aliases
@richarddurbin https://github.com/richarddurbin, will you please explain your above concerns?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub #518 (comment)

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

diekhans mentioned this issue Dec 16, 2015

RNA api schema #517

Closed

diekhans added the Documentation label Feb 4, 2016

diekhans mentioned this issue Apr 28, 2016

API specification allows non-unique reference names in a reference set #617

Closed

macieksmuga mentioned this issue Apr 28, 2016

Document Python Repo API ga4gh/ga4gh-server#1185

Open

diekhans added enhancement and removed Documentation labels Apr 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reference and refererencesets should allow for multiple names. #518

reference and refererencesets should allow for multiple names. #518

diekhans commented Dec 16, 2015

david4096 commented Jan 5, 2016 •

edited

Loading

macieksmuga commented Jan 5, 2016

diekhans commented Jan 5, 2016

mbaudis commented Jan 5, 2016

saupchurch commented Jan 5, 2016

macieksmuga commented Jan 6, 2016

pgrosu commented Jan 6, 2016

diekhans commented Jan 6, 2016

pgrosu commented Jan 6, 2016

diekhans commented Jan 6, 2016

david4096 commented Apr 26, 2016

mbaudis commented Apr 26, 2016

diekhans commented Apr 28, 2016 •

edited

Loading

macieksmuga commented Apr 28, 2016

macieksmuga commented Apr 28, 2016

richarddurbin commented Apr 28, 2016

diekhans commented Apr 28, 2016

diekhans commented Apr 28, 2016

mbaudis commented Apr 28, 2016

diekhans commented Apr 28, 2016

diekhans commented Apr 30, 2016

mbaudis commented May 2, 2016

richarddurbin commented May 2, 2016

reference and refererencesets should allow for multiple names. #518

reference and refererencesets should allow for multiple names. #518

Comments

diekhans commented Dec 16, 2015

david4096 commented Jan 5, 2016 • edited Loading

macieksmuga commented Jan 5, 2016

diekhans commented Jan 5, 2016

mbaudis commented Jan 5, 2016

saupchurch commented Jan 5, 2016

macieksmuga commented Jan 6, 2016

pgrosu commented Jan 6, 2016

diekhans commented Jan 6, 2016

pgrosu commented Jan 6, 2016

diekhans commented Jan 6, 2016

david4096 commented Apr 26, 2016

mbaudis commented Apr 26, 2016

diekhans commented Apr 28, 2016 • edited Loading

macieksmuga commented Apr 28, 2016

macieksmuga commented Apr 28, 2016

richarddurbin commented Apr 28, 2016

diekhans commented Apr 28, 2016

diekhans commented Apr 28, 2016

mbaudis commented Apr 28, 2016

diekhans commented Apr 28, 2016

diekhans commented Apr 30, 2016

mbaudis commented May 2, 2016

richarddurbin commented May 2, 2016

david4096 commented Jan 5, 2016 •

edited

Loading

diekhans commented Apr 28, 2016 •

edited

Loading