Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

reference and refererencesets should allow for multiple names. #518

Open
diekhans opened this issue Dec 16, 2015 · 23 comments
Open

reference and refererencesets should allow for multiple names. #518

diekhans opened this issue Dec 16, 2015 · 23 comments

Comments

@diekhans
Copy link
Contributor

Sequence data often has multiple equivalent names. It would be very useful for GA4GH to represent and query on multiple different names.

for instance GRCh38.p2 is also GCF_000001405.28
Chromosome 1 has the names 1, CM000663.2, NC_000001.11, and chr1

@david4096
Copy link
Member

david4096 commented Jan 5, 2016

There are a few different ways to approach this in my mind, but first let me outline the current state of reference and referencesets.

References have a required name and an optional array of sourceAccessions. Reference Sets have an optional name, an optional assemblyId, and an array of sourceAccessions. Both search endpoints have the ability to name an accession to search on, which is a case-sensitive exact match.

The SearchReferencesRequest document allows for searching using accession but not name. If searching references by name is a desired use case, let's make sure that makes it in there.

Can searching within sourceAccessions serve this feature?

If not, we might make the name field an array so that a single record can be found using more than one search query. (q=chr1 or q=1) This would leave the request and response records more or less intact.

Alternatively, we might benefit from using ExternalIdentifier's throughout the system. This is the view of a software person, not a biologist, though! Even if sourceAccessions could serve the purpose for references and reference sets, the issue of multiple names arises for naming variants across multiple databases, for example. This would allow the same search semantics to exist across all the record types that should may need to be identified in multiple ways.

record ExternalIdentifier {
  /**
  The source of the identifier.
  (e.g. `Ensembl`)
  */
  string database;

  /**
  The ID defined by the external database.
  (e.g. `ENST00000000000`)
  */
  string identifier;

  /**
  The version of the object or the database
  (e.g. `78`)
  */
  string version;
}

@macieksmuga
Copy link
Contributor

ExternalIdentifier is defined in sequenceAnnotations.avdl as part of the RNA seq PR. #517

@diekhans
Copy link
Contributor Author

diekhans commented Jan 5, 2016

ExternalIdentifier is in common.avdl in master. I am a little unsure about the definition, since it's never been used.

@mbaudis
Copy link
Member

mbaudis commented Jan 5, 2016

My take on ExternalIdentifier:

  • yes, should be defined as consistent object; and the attributes look o.k. (though I would add an optional URI)
  • not sure about the "External" part in the name; your external identifier is my local one ... More like a fully qualified identifier or whatever it could be called etc.
  • this is one of those "helper" objects which need a common home, like OntologyTerm, GeographicLocation; moving those to common.avdl or something similar but new would make sense (spattering such objects into both common.avdl and metadata.avdl does not)

@saupchurch
Copy link
Contributor

Turns out ExternalIdentifier was defined in both sequenceAnnotations.avdl and common.avdl in the rna branch. I just pushed an update removing it from sequenceAnnotations.avdl.

@macieksmuga
Copy link
Contributor

Thanks Sean. I'm a bit disturbed that the Avro schema checker run during Travis CI (that helped us find a couple of other genuine schema errors) didn't pick this duplication up.

@pgrosu
Copy link
Contributor

pgrosu commented Jan 6, 2016

Maciek, I don't believe PR #517 is merged yet for the check to catch this, as only the common.avdl schema is merged.

@diekhans
Copy link
Contributor Author

diekhans commented Jan 6, 2016

@macieksmuga the two definitions exist in different namespaces, so they do collide.

Maciek Smuga-Otto notifications@github.com writes:

Thanks Sean. I'm a bit disturbed that the Avro schema checker run during Travis
CI didn't pick this duplication up.


Reply to this email directly or view it on GitHub.*

@pgrosu
Copy link
Contributor

pgrosu commented Jan 6, 2016

Hmm...maybe I'm doing something funny, but when I check the two files, their namespace seem to be the same:


$ head -n1 common.avdl
@namespace("org.ga4gh.models")

$ head -n1 sequenceAnnotations.avdl
@namespace("org.ga4gh.models")

$ echo `head -n1 common.avdl` `head -n1 sequenceAnnotations.avdl` | awk '{ print ($1 == $2) ? "true" : "false" }'
true

Which script and parameters are performing the checking?

~p

@diekhans
Copy link
Contributor Author

diekhans commented Jan 6, 2016

ouch, they are in the same name space... another good reason to ditch avro

@david4096
Copy link
Member

Does the optional array of sourceAccessions fulfill this, or should we propose a schema change to add an ExternalIdentifier array to references?

@mbaudis
Copy link
Member

mbaudis commented Apr 26, 2016

@david4096 Just a warning that "accession" is a loaded base term, which should be defined first (I would go for properly prefixed internal or external id and/or URI). We started this before, but it slipped through the "don't define ahead without a use case" gap. Maybe there is one here.

@diekhans
Copy link
Contributor Author

diekhans commented Apr 28, 2016

X vs chrX are not accessions, they are alternate names. A search by needs to return match either of these.

@macieksmuga
Copy link
Contributor

@diekhans I'm trying to nail down this use case: Does a search by name such as "1" need to return a reference named "chr1", and vice versa? Is this (along with "chr2",... "chrX, "chrY") the ONLY heuristic to build in to a name search, or are there others that we should be aware of?

@macieksmuga
Copy link
Contributor

Also, are searches by accessions for individual references still a valid use case? If so, I'm not sure where I'd find these accessions in the data: Given a single multi-record FASTA file of GRCh37, here's a typical header for one of the chromosomes:

>3 dna:chromosome chromosome:GRCh37:3:1:198022430:1

What part of that, if any, constitutes an accession?

@richarddurbin
Copy link
Contributor

No. A search by name “1” should only return the reference with name “1”. There should be at most one such reference in a referenceSet.
To get something with name “chr1” you need to search for “chr1”.

It is fine to start from the position that we have now that people know for a given dataset whether it uses references with the “1” or “chr1”
convention. If one doesn’t know it is easy to try both. The thing that no user would know is what the internal id is for chromosome 1 - there
clearly has to be a way to find that.

Richard

On 28 Apr 2016, at 17:10, Maciek Smuga-Otto notifications@github.com wrote:

@diekhans https://github.com/diekhans I'm trying to nail down this use case: Does a search by name such as "1" need to return a reference named "chr1", and vice versa? Is this (along with "chr2",... "chrX, "chrY") the ONLY heuristic to build in to a name search, or are there others that we should be aware of?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub #518 (comment)

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@diekhans
Copy link
Contributor Author

there is no need for heuristic name mapping, which can be unpredictable. The names are precisely defined as part of the assembly release:
ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.28.assembly.txt

the accession is not contained in the string:

3 dna:chromosome chromosome:GRCh37:3:1:198022430:1

@diekhans
Copy link
Contributor Author

While we all wish genomics had not ended up with multiple names
for the same reference sequence, it is the reality. It is
enough of a reality that NCBI now the naming part of the
assembly release:

ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.28.assembly.txt

Chromosome 1 has both the names 1' andchr1'. As well as two
accessions (CM000663.2, NC_000001.11).

If we don't bow to this reality and allow multiple names for
references, we will end up needing to have two copies of each
ReferenceSet. It will make for data interoperability between
datasets hard.

We could have a name translation service in the API, however the
problem seems restricted enough that just having an array of
names in each Reference record should solve the problem. A
single name and array of aliases would be fine too.

Internal ids are not intended to be know; they must always be
looked up. Given that (GRCh38',1') and (GRCH38,chr1')
both name the same reference chromosome, they would map to the
same internal ids and object.

Richard Durbin notifications@github.com writes:

No. A search by name “1” should only return the reference with name “1”. There
should be at most one such reference in a referenceSet.
To get something with name “chr1” you need to search for “chr1”.

It is fine to start from the position that we have now that people know for a
given dataset whether it uses references with the “1” or “chr1”
convention. If one doesn’t know it is easy to try both. The thing that no user
would know is what the internal id is for chromosome 1 - there
clearly has to be a way to find that.

Richard

On 28 Apr 2016, at 17:10, Maciek Smuga-Otto notifications@github.com wrote:

@diekhans https://github.com/diekhans I'm trying to nail down this use
case: Does a search by name such as "1" need to return a reference named
"chr1", and vice versa? Is this (along with "chr2",... "chrX, "chrY") the ONLY
heuristic to build in to a name search, or are there others that we should be
aware of?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub <https://github.com/ga4gh/
schemas/issues/518#issuecomment-215479968>

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub*

@mbaudis
Copy link
Member

mbaudis commented Apr 28, 2016

@richarddurbin @diekhans
I'm a but confused here. We don't have to define what's at the backend. We just need to document

  • queries for the reference assembly representing human chromosome 1 in the human GRCh37 genome build

... and nobody cares about the backend annotation, e.g. if this refers to chr1 or 1? We do not want to build column format databases here.

Problem here is more in the generalisation, giving recommendations for how to deal with every conceivable genome. But there should be people with informed opinions?!

(As a side note, there is no a priori reason for having a separation by chromosome; should be just a base position address prefix).

@diekhans
Copy link
Contributor Author

@mbaudis naming doesn't impose any structure on the back end. One has to be able to come in with identifier obtained from outside the API and access data.

@diekhans
Copy link
Contributor Author

We need to resolve this issue, as it is handicapping out ability to deploy real-world data. Before putting together a pull request, I would like to get consensus.

  • A given reference can have multiple names. This could be done by changing Reference.nameto an array names or by adding a second field aliases.
  • All of the reference names in aReferenceSet must still map to one and only one Reference.
  • All searches by name will match against any of the Reference object's names/aliases

@richarddurbin, will you please explain your above concerns?

@mbaudis
Copy link
Member

mbaudis commented May 2, 2016

@diekhans I would prefer to see this solved in a universal way, and would prefer to have this done through a general alternativeNames | aliases ... (array, string) attribute, which also could be used in other records. This is something we need e.g. for capturing alternative sample identifiers anyway; and if you change name to array you would have to do this everywhere for consistency.

@richarddurbin
Copy link
Contributor

This is complicated. Our data model for names and identifiers on references has drifted over the last couple of years.
I spent a long time sorting out what has happened, and have sent a longer email to a smaller subset of people rather
than the whole list, to try to reduce the confusion that I fear will follow.

Richard

On 30 Apr 2016, at 02:09, Mark Diekhans notifications@github.com wrote:

We need to resolve this issue, as it is handicapping out ability to deploy real-world data. Before putting together a pull request, I would like to get consensus.

A given reference can have multiple names. This could be done by changing Reference.nameto an array names or by adding a second field aliases.
All of the reference names in aReferenceSet must still map to one and only one Reference.
All searches by name will match against any of the Reference object's names/aliases
@richarddurbin https://github.com/richarddurbin, will you please explain your above concerns?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub #518 (comment)

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants