-
Notifications
You must be signed in to change notification settings - Fork 110
How do repeated info tags in variants fit with GAKeyValue? #135
Comments
@cassiedoll, These are very helpful, especially in later analysis. They will probably have their own Avdl file at some point in time for the sake of clarity, so that eventually they might become searchable criteria (i.e. tables, trees, etc.). |
There's also: d) promote these to first-class attributes on the I don't particularly like (e), but it may be the only way to support VCF without requiring clients to parse values encoded in strings. It would be nice if something like (f) would work, but there are other fields like |
we'll definitely do (d) as we go along, but i'm not sure we can get it done-enough to make it into v0.5.1 - which means we need some plan for implementation. (e) makes me the saddest ever. |
Why not make GAKeyValue something that can be either an array or a single value? |
That would be a vote for option (b). |
This is how vcflib internally handles this. It's a pain but it's better than many alternatives. On Mon, Sep 8, 2014 at 1:42 PM, cassiedoll notifications@github.com wrote:
|
@cassiedoll, that's fine for now, but will you search for them every time? How will you perform analysis across many |
I agree with @cassiedoll here: I absolutely vote for every (!?) value being an array. Avoids lots of pain & sneaky workarounds. Arrays of length=1 can be requested by the schema, as can be restricted lists. |
o.k.: |
Okay so changing KeyValue to:
is the current consensus I think? I can make a PR for this unless someone else has a better plan. |
Is there no mechanism to allow for any structure? It would be incredibly |
I'm sure there is a mechanism, but I don't agree that this is desired. As I said on the mailing list, I'd prefer to restrict the key/values to strings simply so that using them for formal types isn't encouraged. |
What problems could result? Would you please give some examples? |
That implies losing strong typing, and strong typing has its advantages. |
I agree that strong typing is important. Adding first class fields with well defined types, good comments and docs, etc is very important. That's probably the most important thing we are doing right now :) Putting that information into a generic keyvalue bucket however, can lead to fragmentation. If we allowed a keyvalue to map to any structured object, why wouldn't an entire variant just look like this?
absolutely everything could be represented with that object, but it wouldn't make for a very good API. Users wouldn't know how a field like genotype was represented. The compliance tests would be almost useless as nothing would be validated. Representation of something like genotype could actually split in the community, and so writing common code would be very difficult. And so forth. I think that's the main reason why I object to supporting anything more than strings (or possibly arrays of strings) in the generic KeyValue. It's good to allow an outlet for extensions to the API. It is not good to allow extensions to take the role of real first class fields. |
Hi Cassie, It might be easier to resolve this as @ekg recommended. Please take the first example from 1000 Genomes from the link below, and show us the actual structure in how you would like to format and store it using data from the VCF file under Example 0: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 I think it help with clearing up any confusion and drive the discussion on the pros/cons of the implementation. Thank you, |
As @lh3 noted on the mailing list, losing the typing information for the On Mon, Sep 8, 2014 at 5:40 PM, Paul Grosu notifications@github.com wrote:
|
One possible solution is to add a "type" field to GAKeyValue but still keep optional tags as strings:
|
If the GAVariant is equivalent to a VCF record, there will also need to be a flag describing if this value is attached to the allele or the variant itself. A cleaner solution might be to move allele-specific annotations into an allele object., which is currently represented as an array of strings https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/variants.avdl#L161. An allele object could allow the attachment of annotations to both the reference and alternates. |
@cassiedoll makes a great point about fragmentation - if we rely on the It seems to me that the discussion is a bit abstract at the moment, because I'm not quite sure exactly what info fields are going to be promoted to first class attributes of the |
@jeromekelleher In my experience, the usefulness of INFO fields varies greatly with application. We see programs adding new tags or removing existing ones all the time (for good reasons). High on such a list should probably be functional consequences of variants. Also allele frequency in various populations. |
Really what we want is the type to be specified for the key, not in each KeyValue pair. There are two issues created by this: So far GA4GH has ignored this, or at least shelved it, but this typed data extension is an important non-trivial feature of BAM and VCF. I am with Cassie in agreeing to move as much as possible into standard slots, but experience is that there are always new keys being defined and used in a non-centralised fashion by a wide variety of users/systems/software. We could view this as allowing inheritance from the core globally defined GA4GH data types. The level 0 option is then for people to store a serialised of the original material in a blob. I think we are going to need to come back to the discussion of scope - this is related to the earlier discussion about Study or Project, which has more to do with data access than key definitions. Richard On 8 Sep 2014, at 23:26, Heng Li notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
I am also in favor of introducing more fixed fields and reducing optional tags. I have enough troubles with extracting read depth from various variant callers. However, typed tagging is an indispensable part of VCF (not so much of SAM). VCF tags are harder to be standardized. Filtering variants by a tag in the INFO field is a necessary and fairly frequent operation. GATK, bcftools and vcflib all provide ways to filter by arbitrary tags. It would be a step backward if GA4GH does not support this. I realize that type alone is not adequate, either. We also need description of tags and perhaps as Erik said, the "A" and "G" Number types. I think we should closely mirror the VCF header in GACallSet, defining the type, size and description of each tag. We still keep values as strings, just like the VCF way. Programs need to convert string values to appropriate typed values when analyzing the fields. |
Before commenting, it would be great to be added to the mailing list, in order to more precisely contribute to the whole discussion. Currently I just use the Github schemas and comments as a reference, which I feel might be half of the discussion. My email is pgrosu@gmail.com Looking back at the VCF specs, these are well-defined, fixed fields under section 1.4.1 and more specifically 1.4.1.8 - below are the links to spec v4.1 and v4.2: http://samtools.github.io/hts-specs/VCFv4.1.pdf http://samtools.github.io/hts-specs/VCFv4.2.pdf We're just using these for now, since it's what we have. Later on, this part will probably evolve to more dynamically generated fields, which most likely will be a few required ones. With v0.5 I think we still have time to explore. I like @lh3 approach of a generic via type definitions, but it would be easier to just perform checks on something like Thank you, |
In my opinion, we should use both approaches (as mentioned by @lh3 @richarddurbin @cassiedoll and others): Fixed fields for "commonly used" fields (e.g. AF, DP, variant functional annotations, etc.), and leave the others in a generic GAKeyValue. GAKeyValue should be tag-typed and contain VCF-style number information ( per allele, per genotype, fixed number of values, etc.). The default scoping for these generic tags should follow programming common sense: i) local file scope, ii) file-set scope, iii) project scope, iv) global scope. This leaves enough room for flexibility (and for shooting yourself in the foot). Which INFO fields are used commonly enough to be promoted to "fixed fields"? (we are taking care of some of the functional annotation ones in PR #126 ). Obviously the ones mentioned in VCF 4.2 is a good start. Some comments to reduce field pollution: Set membership fields (H2, H3, 1000G, etc.) can be added as a single "membership" field. DB is already contemplated in GAVariant.id field. DP, CIGAR, BQ, and MQ* probably belong to GACall. So the list is reduced to: INFO fields to "promote" to fixed fields (some may be methods): Already in GAVariant: Collapse these into a single 'membership' set: Belong to GACall: This comment is already long enough, so I'll leave the genotype-based fields for later. |
Another related topic: As far as I understand, the VCF spec. does not explicitly forbid two INFO fields having the same key. Please, correct me if I'm wrong, but this means that "AF=0.1;AF=0.2" is a valid VCF INFO field (although I'm pretty sure that was not the intention of the spec.). Personally, I think that key-value pairs should be limited to have only one entry per key (and perhaps the VCF spec. should be amended in the same direction?). |
Duplicated keys may not be explicitly prohibited, but it will break every The reason for multiple fields is that annotations can (and often should) |
In ADAM, we use "unary" genotypes, which enforce biallelism. We then have a separate schema for allele-specific annotations, and join these on the variant key. |
@ekg, like you said, it should not be done, but here's how you can use maps and store multiple values for the same Info field - using a combination of @lh3's and @cassiedoll's proposed structures:
Alternatively you can do this, which is a little cleaner:
|
@fnothaft, can you point us to the ADAM Avro IDL schema that highlights this fix? I could only find the source code on github. I think it could help us out here. Thanks, |
That's a really useful list @pcingola, thanks! I like the idea of the Duplicated keys would be really unpleasant; I would be in favour of pretty much any alternative to this. |
I am in favour of banning duplicated keys, including in VCF. Also order should not matter. This is truly a map. Sent from my iPhone
The Wellcome Trust Sanger Institute is operated by Genome Research |
+1 for a set of commonly used fields One thing that I have been meaning to bring up, and I do this with great hesitation, is whether we should introduce formal Ontology concepts in addition to (or in place of) GAKeyValue. record GAOntology {
string name;
string prefix;
string URI;
}
record GAOntologyTerm {
string name;
string accession;
string value;
string source_ontology_id;
}
|
@pgrosu Right, this is roughly how things work now in VCF. How can you indicate that a particular field is referring to the entire variant record (e.g. |
@jeromekelleher You're asking What sort of things are not reserved by the VCF spec but are commonly used 'in the wild'. It very depends on the application. Is the API meant for all stages of data processing including, say, variant calling but also functional analysis? |
@pd3 There is a PR for functional annotations, so that part is begin standardized (may be you have some comments to improve that PR). I guess the questions boils down to: Which tags are important enough to be promoted to fields? |
This is a very good question @pd3, and I don't really know. There's a more specific question for me: do we imagine these @richarddurbin: "If I want to calculate a new quality control metric for each call in a call set, and add it as a new key:value attribute in the info map, what would have to change?" @cassiedoll: "Let's additionally pretend that I had some new info tag I was messing around with. I should be able to run some analysis on my Variants, come up with my snazzy info tag, and store it back into the API." Is this really a good model though? Wouldn't it be better to have a read/write 'table' where a user can add annotations (not to be confused with the functional annotations of #126!) against a given object ID, rather than update the object itself? If we take this approach, then perhaps there isn't as much pressure to make the
This might also help from a reproducability perspective, since we can add as many (time stamped) annotations into this table as we like without interfering with the primary data. |
@ekg, Sorry to reply so late. I got caught up in another discussion. So let's approach this through, encapsulation which is similar to what we have now. Inheritance would be neater if it was possible in Avro/IDL, but the code would become more cumbersome. So below is a general class structure how I generally see this and where you have granularity in updating the INFO fields:
Based on this, it would require minimal changes to the schema, but some changes would be necessary. Again, I should reiterate that we are still looking at the API through the lens of simple, clean designs but I'm not so sure they will scale. I agree that it's good to start somewhere, but at some point we should start thinking in petascale, parallel data-structures which opens up other analysis options we will require to perform later on. Just like @richarddurbin suggested in #142 by wanting "to calculate a new quality control metric for each call in a call set, and add it as a new key:value attribute in the info map" we can have such flexibility. The idea of approaches like Google Pregel in conjunction with Google Mesa would allow the opportunity to update data on the fly, and have the updates be propagated as a new version across all the studies for side-by-side comparison. That would open up a new world for large-scale data sharing, processing and analysis. |
Hi Benedict, The slides look great and fully agree this is essential. Not sure I can make it to ASHG, but would #143 be part of this discussion? Thanks, |
@benedictpaten I think those slides belong on #142 instead? |
Oops, yes. Hold on.. need coffee or something. |
Is this ready to be closed? |
The last comment on this was Oct 2014 I'm closing this in 2 days. |
Closing due to inactivity. This can be reopened if necessary. |
Filename in convert_to_binary.sh is now correct.
There are some variants and call info fields which are repeated in VCF. (e.g. any tag that is repeated per-allele - AC, AF, HQ, EC etc)
How should these fields be represented as GAKeyValue - which has a single string value for each key?
Should we:
a) smash values together in some well known way :(
b) make
GAKeyValue::value
an array :(c) some better idea I'm missing?
The text was updated successfully, but these errors were encountered: