-
Notifications
You must be signed in to change notification settings - Fork 110
reference and refererencesets should allow for multiple names. #518
Comments
There are a few different ways to approach this in my mind, but first let me outline the current state of References have a required The Can searching within If not, we might make the Alternatively, we might benefit from using
|
|
ExternalIdentifier is in common.avdl in master. I am a little unsure about the definition, since it's never been used. |
My take on
|
Turns out ExternalIdentifier was defined in both sequenceAnnotations.avdl and common.avdl in the rna branch. I just pushed an update removing it from sequenceAnnotations.avdl. |
Thanks Sean. I'm a bit disturbed that the Avro schema checker run during Travis CI (that helped us find a couple of other genuine schema errors) didn't pick this duplication up. |
Maciek, I don't believe PR #517 is merged yet for the check to catch this, as only the common.avdl schema is merged. |
@macieksmuga the two definitions exist in different namespaces, so they do collide. Maciek Smuga-Otto notifications@github.com writes:
|
Hmm...maybe I'm doing something funny, but when I check the two files, their namespace seem to be the same:
Which script and parameters are performing the checking? ~p |
ouch, they are in the same name space... another good reason to ditch avro |
Does the optional array of sourceAccessions fulfill this, or should we propose a schema change to add an ExternalIdentifier array to references? |
@david4096 Just a warning that "accession" is a loaded base term, which should be defined first (I would go for properly prefixed internal or external id and/or URI). We started this before, but it slipped through the "don't define ahead without a use case" gap. Maybe there is one here. |
X vs chrX are not accessions, they are alternate names. A search by needs to return match either of these. |
@diekhans I'm trying to nail down this use case: Does a search by name such as "1" need to return a reference named "chr1", and vice versa? Is this (along with "chr2",... "chrX, "chrY") the ONLY heuristic to build in to a name search, or are there others that we should be aware of? |
Also, are searches by accessions for individual references still a valid use case? If so, I'm not sure where I'd find these accessions in the data: Given a single multi-record FASTA file of GRCh37, here's a typical header for one of the chromosomes:
What part of that, if any, constitutes an accession? |
No. A search by name “1” should only return the reference with name “1”. There should be at most one such reference in a referenceSet. It is fine to start from the position that we have now that people know for a given dataset whether it uses references with the “1” or “chr1” Richard
The Wellcome Trust Sanger Institute is operated by Genome Research |
there is no need for heuristic name mapping, which can be unpredictable. The names are precisely defined as part of the assembly release: the accession is not contained in the string:
|
While we all wish genomics had not ended up with multiple names ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.28.assembly.txt Chromosome 1 has both the names If we don't bow to this reality and allow multiple names for We could have a name translation service in the API, however the Internal ids are not intended to be know; they must always be Richard Durbin notifications@github.com writes:
|
@richarddurbin @diekhans
... and nobody cares about the backend annotation, e.g. if this refers to Problem here is more in the generalisation, giving recommendations for how to deal with every conceivable genome. But there should be people with informed opinions?! (As a side note, there is no a priori reason for having a separation by chromosome; should be just a base position address prefix). |
@mbaudis naming doesn't impose any structure on the back end. One has to be able to come in with identifier obtained from outside the API and access data. |
We need to resolve this issue, as it is handicapping out ability to deploy real-world data. Before putting together a pull request, I would like to get consensus.
@richarddurbin, will you please explain your above concerns? |
@diekhans I would prefer to see this solved in a universal way, and would prefer to have this done through a general |
This is complicated. Our data model for names and identifiers on references has drifted over the last couple of years. Richard
The Wellcome Trust Sanger Institute is operated by Genome Research |
Sequence data often has multiple equivalent names. It would be very useful for GA4GH to represent and query on multiple different names.
for instance GRCh38.p2 is also GCF_000001405.28
Chromosome 1 has the names 1, CM000663.2, NC_000001.11, and chr1
The text was updated successfully, but these errors were encountered: