Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

How do repeated info tags in variants fit with GAKeyValue? #135

Closed
cassiedoll opened this issue Sep 3, 2014 · 45 comments
Closed

How do repeated info tags in variants fit with GAKeyValue? #135

cassiedoll opened this issue Sep 3, 2014 · 45 comments

Comments

@cassiedoll
Copy link
Member

There are some variants and call info fields which are repeated in VCF. (e.g. any tag that is repeated per-allele - AC, AF, HQ, EC etc)

How should these fields be represented as GAKeyValue - which has a single string value for each key?

Should we:
a) smash values together in some well known way :(
b) make GAKeyValue::value an array :(
c) some better idea I'm missing?

@pgrosu
Copy link
Contributor

pgrosu commented Sep 4, 2014

@cassiedoll, These are very helpful, especially in later analysis. They will probably have their own Avdl file at some point in time for the sake of clarity, so that eventually they might become searchable criteria (i.e. tables, trees, etc.).

@jeromekelleher
Copy link
Contributor

There's also:

d) promote these to first-class attributes on the Variant or Call types;
e) make KeyValue.value a record, which encodes the type and cardinality of the data it encapsulates;
f) make a separate alleleInfo field in the Variant type in which keys map to arrays.

I don't particularly like (e), but it may be the only way to support VCF without requiring clients to parse values encoded in strings. It would be nice if something like (f) would work, but there are other fields like INFO.CIEND which are arrays but have nothing to do with alleles.

@cassiedoll
Copy link
Member Author

we'll definitely do (d) as we go along, but i'm not sure we can get it done-enough to make it into v0.5.1 - which means we need some plan for implementation.

(e) makes me the saddest ever.
(f) is a definite possibility...

@ekg
Copy link

ekg commented Sep 8, 2014

Why not make GAKeyValue something that can be either an array or a single value?

@cassiedoll
Copy link
Member Author

That would be a vote for option (b).
(It's simpler to have the value always be an array than it is to explicitly allow both types)

@ekg
Copy link

ekg commented Sep 8, 2014

This is how vcflib internally handles this.

It's a pain but it's better than many alternatives.

On Mon, Sep 8, 2014 at 1:42 PM, cassiedoll notifications@github.com wrote:

That would be a vote for option (b).
(It's simpler to have the value always be an array than it is to
explicitly allow both types)


Reply to this email directly or view it on GitHub
#135 (comment).

@pgrosu
Copy link
Contributor

pgrosu commented Sep 8, 2014

@cassiedoll, that's fine for now, but will you search for them every time? How will you perform analysis across many GAVariantSets? See my comment for alternatives.

@mbaudis
Copy link
Member

mbaudis commented Sep 8, 2014

I agree with @cassiedoll here: I absolutely vote for every (!?) value being an array. Avoids lots of pain & sneaky workarounds.

Arrays of length=1 can be requested by the schema, as can be restricted lists.

@mbaudis
Copy link
Member

mbaudis commented Sep 8, 2014

o.k.:
g) lists of (e) should accommodate for everything ;-)

@cassiedoll
Copy link
Member Author

Okay so changing KeyValue to:

record GAKeyValue {
  string key;
  array<string> value; 
}

is the current consensus I think?

I can make a PR for this unless someone else has a better plan.

@ekg
Copy link

ekg commented Sep 8, 2014

Is there no mechanism to allow for any structure? It would be incredibly
useful if this could contain lists, dictionaries, and atomic types. As
noted this is a huge limitation with VCF.

@cassiedoll
Copy link
Member Author

I'm sure there is a mechanism, but I don't agree that this is desired. As I said on the mailing list, I'd prefer to restrict the key/values to strings simply so that using them for formal types isn't encouraged.
Keys that need better typing should become first class fields on the variant and call objects.

@ekg
Copy link

ekg commented Sep 8, 2014

What problems could result? Would you please give some examples?

@sdumitriu
Copy link
Member

That implies losing strong typing, and strong typing has its advantages.

@cassiedoll
Copy link
Member Author

I agree that strong typing is important. Adding first class fields with well defined types, good comments and docs, etc is very important. That's probably the most important thing we are doing right now :)

Putting that information into a generic keyvalue bucket however, can lead to fragmentation. If we allowed a keyvalue to map to any structured object, why wouldn't an entire variant just look like this?

record GAVariant {
  repeated GAKeyValue fields;
}

absolutely everything could be represented with that object, but it wouldn't make for a very good API. Users wouldn't know how a field like genotype was represented. The compliance tests would be almost useless as nothing would be validated. Representation of something like genotype could actually split in the community, and so writing common code would be very difficult. And so forth.

I think that's the main reason why I object to supporting anything more than strings (or possibly arrays of strings) in the generic KeyValue. It's good to allow an outlet for extensions to the API. It is not good to allow extensions to take the role of real first class fields.

@pgrosu
Copy link
Contributor

pgrosu commented Sep 8, 2014

Hi Cassie,

It might be easier to resolve this as @ekg recommended. Please take the first example from 1000 Genomes from the link below, and show us the actual structure in how you would like to format and store it using data from the VCF file under Example 0:

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

I think it help with clearing up any confusion and drive the discussion on the pros/cons of the implementation.

Thank you,
Paul

@ekg
Copy link

ekg commented Sep 8, 2014

As @lh3 noted on the mailing list, losing the typing information for the
INFO field items will make it very difficult to interchange to VCF. Perhaps
this is a slightly different problem, in that it could be resolved by
another specification of the type of the value. Every value could have an
associated type.

On Mon, Sep 8, 2014 at 5:40 PM, Paul Grosu notifications@github.com wrote:

Hi Cassie,

It might be easier to resolve this as @ekg https://github.com/ekg
recommended. Please take the first example from 1000 Genomes from the link
below, and show us the actual structure in how you would like to format and
store it using data from the VCF file under Example 0:

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

I think it help with clearing up any confusion and drive the discussion on
the pros/cons of the implementation.

Thank you,
Paul


Reply to this email directly or view it on GitHub
#135 (comment).

@lh3
Copy link
Member

lh3 commented Sep 8, 2014

One possible solution is to add a "type" field to GAKeyValue but still keep optional tags as strings:

enum GAValueType { INT,REAL,STRING }; // probably more
record GAKeyValue {
  string key;
  string value; // or array<string> values; I am neutral.
  GAValueType type;
}

@ekg
Copy link

ekg commented Sep 8, 2014

If the GAVariant is equivalent to a VCF record, there will also need to be a flag describing if this value is attached to the allele or the variant itself.

A cleaner solution might be to move allele-specific annotations into an allele object., which is currently represented as an array of strings https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/variants.avdl#L161.

An allele object could allow the attachment of annotations to both the reference and alternates.

@jeromekelleher
Copy link
Contributor

@cassiedoll makes a great point about fragmentation - if we rely on the info field to hold important values, then there's a whole world of trouble involved. Also, I don't think we could ever support complex queries of the type that @lh3 outlined on the mailing list unless the fields involved were first class attributes. It would be very hard for a backend to do the indexing it needs if it had to support arbitrary queries on an arbitrary bag of info fields.

It seems to me that the discussion is a bit abstract at the moment, because I'm not quite sure exactly what info fields are going to be promoted to first class attributes of the Variant and Call objects. Would anyone like to make a first pass at making a list of these fields? (@lh3 maybe?) We don't need to start implementing these immediately, but it might make it a little more clear what the remaining 'second class' fields would be and how general they would need to be.

@pd3
Copy link

pd3 commented Sep 9, 2014

@jeromekelleher In my experience, the usefulness of INFO fields varies greatly with application. We see programs adding new tags or removing existing ones all the time (for good reasons).

High on such a list should probably be functional consequences of variants. Also allele frequency in various populations.

@richarddurbin
Copy link
Contributor

Really what we want is the type to be specified for the key, not in each KeyValue pair.
There should be a declaration of keys, with the types that they attach to, and if possible further documentation/description. That is what is in the VCF or BAM header, and allows for key-value pairs to be stored and manipulated in binary. Also I think the declaration can, or could, say something about how the types behave, e.g. are they per allele in VCF.
In effect what this INFO key:value construction does is provide for dynamic strongly typed extension of the data structure, which the relevant library code such as htslib supports.

There are two issues created by this:
1) This is complex to implement, and looks like work that really a programming language should be handling. However we want to be language-neutral, and AVRO doesn't support it.
2) It is not clear how to manage scope of these key declarations when we move away from single files. We could view this also something that languages support. In any case we would need a policy.

So far GA4GH has ignored this, or at least shelved it, but this typed data extension is an important non-trivial feature of BAM and VCF. I am with Cassie in agreeing to move as much as possible into standard slots, but experience is that there are always new keys being defined and used in a non-centralised fashion by a wide variety of users/systems/software.

We could view this as allowing inheritance from the core globally defined GA4GH data types. The level 0 option is then for people to store a serialised of the original material in a blob.
The next level up is to store a key:value map with the key being understood, but the value being a string, essentially serialising the information attached to the key. This is where we are now.
Cassie and other propose extending the value to an array of strings, which would deal with quite a big expansion of the type space, at not too high a cost for scalars.
The next level up is to support a more complex type description, which would take us out of AVRO.

I think we are going to need to come back to the discussion of scope - this is related to the earlier discussion about Study or Project, which has more to do with data access than key definitions.

Richard

On 8 Sep 2014, at 23:26, Heng Li notifications@github.com wrote:

One possible solution is to add a "type" field to GAKeyValue but still keep optional tags as strings:

enum GAValueType { INT,REAL,STRING }; // probably more
record GAKeyValue {
string key;
string value; // or array values; I am neutral.
GAValueType type;
}

Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@lh3
Copy link
Member

lh3 commented Sep 9, 2014

I am also in favor of introducing more fixed fields and reducing optional tags. I have enough troubles with extracting read depth from various variant callers. However, typed tagging is an indispensable part of VCF (not so much of SAM). VCF tags are harder to be standardized. Filtering variants by a tag in the INFO field is a necessary and fairly frequent operation. GATK, bcftools and vcflib all provide ways to filter by arbitrary tags. It would be a step backward if GA4GH does not support this.

I realize that type alone is not adequate, either. We also need description of tags and perhaps as Erik said, the "A" and "G" Number types. I think we should closely mirror the VCF header in GACallSet, defining the type, size and description of each tag. We still keep values as strings, just like the VCF way. Programs need to convert string values to appropriate typed values when analyzing the fields.

@pgrosu
Copy link
Contributor

pgrosu commented Sep 9, 2014

Before commenting, it would be great to be added to the mailing list, in order to more precisely contribute to the whole discussion. Currently I just use the Github schemas and comments as a reference, which I feel might be half of the discussion. My email is pgrosu@gmail.com

Looking back at the VCF specs, these are well-defined, fixed fields under section 1.4.1 and more specifically 1.4.1.8 - below are the links to spec v4.1 and v4.2:

http://samtools.github.io/hts-specs/VCFv4.1.pdf

http://samtools.github.io/hts-specs/VCFv4.2.pdf

We're just using these for now, since it's what we have. Later on, this part will probably evolve to more dynamically generated fields, which most likely will be a few required ones. With v0.5 I think we still have time to explore.

I like @lh3 approach of a generic via type definitions, but it would be easier to just perform checks on something like if ((GAVariant::Info::Somatic) && (GAVariant::Info::Validated)) then ...

Thank you,
Paul

@pcingola
Copy link
Contributor

pcingola commented Sep 9, 2014

In my opinion, we should use both approaches (as mentioned by @lh3 @richarddurbin @cassiedoll and others): Fixed fields for "commonly used" fields (e.g. AF, DP, variant functional annotations, etc.), and leave the others in a generic GAKeyValue.

GAKeyValue should be tag-typed and contain VCF-style number information ( per allele, per genotype, fixed number of values, etc.). The default scoping for these generic tags should follow programming common sense: i) local file scope, ii) file-set scope, iii) project scope, iv) global scope. This leaves enough room for flexibility (and for shooting yourself in the foot).

Which INFO fields are used commonly enough to be promoted to "fixed fields"? (we are taking care of some of the functional annotation ones in PR #126 ). Obviously the ones mentioned in VCF 4.2 is a good start. Some comments to reduce field pollution: Set membership fields (H2, H3, 1000G, etc.) can be added as a single "membership" field. DB is already contemplated in GAVariant.id field. DP, CIGAR, BQ, and MQ* probably belong to GACall.

So the list is reduced to:

INFO fields to "promote" to fixed fields (some may be methods):
AA : ancestral allele
AC : allele count in genotypes
AF : allele frequency for each ALT
NS : Number of samples with data
SB : strand bias at this position

Already in GAVariant:
DB : dbSNP membership ('id' field)
END : end position of the variant ('end' field)

Collapse these into a single 'membership' set:
H2 : membership in hapmap2
H3 : membership in hapmap3
1000G : membership in 1000 Genomes
VALIDATED : validated by follow-up experiment
SOMATIC : indicates that the record is a somatic mutation, for cancer genomics

Belong to GACall:
AN : total number of alleles
BQ : RMS base quality at this position
CIGAR : cigar string
DP : combined depth across samples
MQ : RMS mapping quality, e.g. MQ=52
MQ0 : Number of MAPQ == 0 reads covering this record

This comment is already long enough, so I'll leave the genotype-based fields for later.
Are there any other fields that deserve promotion?

@pcingola
Copy link
Contributor

pcingola commented Sep 9, 2014

Another related topic: As far as I understand, the VCF spec. does not explicitly forbid two INFO fields having the same key. Please, correct me if I'm wrong, but this means that "AF=0.1;AF=0.2" is a valid VCF INFO field (although I'm pretty sure that was not the intention of the spec.).

Personally, I think that key-value pairs should be limited to have only one entry per key (and perhaps the VCF spec. should be amended in the same direction?).

@ekg
Copy link

ekg commented Sep 9, 2014

Duplicated keys may not be explicitly prohibited, but it will break every
parser I know of.

The reason for multiple fields is that annotations can (and often should)
be allele-specific. How would you represent these if we limit ourselves to
single key/value pairs?

@fnothaft
Copy link
Contributor

fnothaft commented Sep 9, 2014

The reason for multiple fields is that annotations can (and often should)
be allele-specific. How would you represent these if we limit ourselves to
single key/value pairs?

In ADAM, we use "unary" genotypes, which enforce biallelism. We then have a separate schema for allele-specific annotations, and join these on the variant key.

@pgrosu
Copy link
Contributor

pgrosu commented Sep 9, 2014

@ekg, like you said, it should not be done, but here's how you can use maps and store multiple values for the same Info field - using a combination of @lh3's and @cassiedoll's proposed structures:

map<string> InfoField = array<string>;  // The notation is not-AVRO-happy,
                                        // but you can have this in most 
                                        // programming languages.

Alternatively you can do this, which is a little cleaner:

enum GAInfoField { "AA", "AC", "AF", "NS", "SB", ... }

enum GAInfoDataType { INT, BOOLEAN, REAL, STRING }

record GACallInfo {
  GAInfoType InfoField;
  GAInfoDataType InfoDataType;
  array<string> InfoValue; 
}

@pgrosu
Copy link
Contributor

pgrosu commented Sep 10, 2014

@fnothaft, can you point us to the ADAM Avro IDL schema that highlights this fix? I could only find the source code on github. I think it could help us out here.

Thanks,
Paul

@jeromekelleher
Copy link
Contributor

That's a really useful list @pcingola, thanks! I like the idea of the membership field too; this is a useful simplification and is also extensible. What sort of things are not reserved by the VCF spec but are commonly used 'in the wild'? (@pd3?)

Duplicated keys would be really unpleasant; I would be in favour of pretty much any alternative to this.

@richarddurbin
Copy link
Contributor

I am in favour of banning duplicated keys, including in VCF. Also order should not matter. This is truly a map.

Sent from my iPhone

On 10 Sep 2014, at 09:19, Jerome Kelleher notifications@github.com wrote:

That's a really useful list @pcingola, thanks! I like the idea of the membership field too; this is a useful simplification and is also extensible. What sort of things are not reserved by the VCF spec but are commonly used 'in the wild'? (@pd3?)

Duplicated keys would be really unpleasant; I would be in favour of pretty much any alternative to this.


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@delagoya
Copy link
Contributor

+1 for a set of commonly used fields
+1 for banning duplicate keys
+1 for keeping a generic GAKeyValue array for "other" category

One thing that I have been meaning to bring up, and I do this with great hesitation, is whether we should introduce formal Ontology concepts in addition to (or in place of) GAKeyValue.

record GAOntology {
  string name;
  string prefix;
  string URI;
} 

record GAOntologyTerm {
  string name;
  string accession;
  string value;
  string source_ontology_id; 
}

@cassiedoll
Copy link
Member Author

@delagoya - btw metadata is adding a GAOntologyTerm just like this in #136

@ekg
Copy link

ekg commented Sep 10, 2014

@pgrosu Right, this is roughly how things work now in VCF. How can you indicate that a particular field is referring to the entire variant record (e.g. DP) or a single allele (e.g. AC)?

@pd3
Copy link

pd3 commented Sep 11, 2014

@jeromekelleher You're asking What sort of things are not reserved by the VCF spec but are commonly used 'in the wild'. It very depends on the application. Is the API meant for all stages of data processing including, say, variant calling but also functional analysis?
At the level of variant calling and filtering, all callers use different metrics to make the calls and estimate call quality, compare for example annotations generated by samtools, freebayes or GATK. Similar situation is in functional annotations generated by, for example, SnpEff or VEP, but here, I believe, the annotations could be standardized more easily.

@pcingola
Copy link
Contributor

@pd3 There is a PR for functional annotations, so that part is begin standardized (may be you have some comments to improve that PR).

I guess the questions boils down to: Which tags are important enough to be promoted to fields?

@jeromekelleher
Copy link
Contributor

This is a very good question @pd3, and I don't really know. There's a more specific question for me: do we imagine these info fields as being a read/write bag for extra data a client wishes to associate with a given object? The recent discussion on #142 suggests we do:

@richarddurbin: "If I want to calculate a new quality control metric for each call in a call set, and add it as a new key:value attribute in the info map, what would have to change?"

@cassiedoll: "Let's additionally pretend that I had some new info tag I was messing around with. I should be able to run some analysis on my Variants, come up with my snazzy info tag, and store it back into the API."

Is this really a good model though? Wouldn't it be better to have a read/write 'table' where a user can add annotations (not to be confused with the functional annotations of #126!) against a given object ID, rather than update the object itself? If we take this approach, then perhaps there isn't as much pressure to make the info map so general. One possible approach is:

  1. Upgrade the important VCF INFO fields above to first class status;
  2. Push any remaining INFO tags in our input VCF into the info map as string/array<string> (I'm neutral on this) values;
  3. Provide a good API mechanism to read/write annotations on these Variant objects (not through inserting key/value pairs into the info map).

This might also help from a reproducability perspective, since we can add as many (time stamped) annotations into this table as we like without interfering with the primary data.

@pgrosu
Copy link
Contributor

pgrosu commented Sep 11, 2014

@ekg, Sorry to reply so late. I got caught up in another discussion. So let's approach this through, encapsulation which is similar to what we have now. Inheritance would be neater if it was possible in Avro/IDL, but the code would become more cumbersome. So below is a general class structure how I generally see this and where you have granularity in updating the INFO fields:

enum GAInfoField { AA, AC, AF, NS, SB, ... }

enum GAInfoDataType { INT, BOOLEAN, FLOAT, STRING }

class GACallInfo () {

  GAInfoType InfoField;
  GAInfoDataType InfoDataType;
  String InfoValue; 
  ...
}

class GAVariantSet ( ... ) {

  ...

  GACallInfo ci = new GACallInfo();
  map<GACallSet> GACallSetMap = { 'GACallSet.id' : new GACallSet (ci), ... }
  map<GAVariant> VariantMap   = { 'GAVariant.id' : new GAVariant (ci), ... }

}

class GACallSet (GACallInfo ci, ...)  {
  map<GACall> GACallMap = { 'GACall.id' : new GACall (ci), ... }
}

class GAVariant (GACallInfo ci, ...) { 
  GACallInfo ci = update(GACallInfo ci);  // Here you update DP
  map<GACall> GACallMap = { 'GACall.id' : new GACall (ci), ... }
}

class GACall (GACallInfo ci, ...)  { 
  GACallInfo ci = update(GACallInfo ci); // Here you would update AC
}

Based on this, it would require minimal changes to the schema, but some changes would be necessary.

Again, I should reiterate that we are still looking at the API through the lens of simple, clean designs but I'm not so sure they will scale. I agree that it's good to start somewhere, but at some point we should start thinking in petascale, parallel data-structures which opens up other analysis options we will require to perform later on. Just like @richarddurbin suggested in #142 by wanting "to calculate a new quality control metric for each call in a call set, and add it as a new key:value attribute in the info map" we can have such flexibility. The idea of approaches like Google Pregel in conjunction with Google Mesa would allow the opportunity to update data on the fly, and have the updates be propagated as a new version across all the studies for side-by-side comparison. That would open up a new world for large-scale data sharing, processing and analysis.

@pgrosu
Copy link
Contributor

pgrosu commented Sep 30, 2014

Hi Benedict,

The slides look great and fully agree this is essential. Not sure I can make it to ASHG, but would #143 be part of this discussion?

Thanks,
Paul

@cassiedoll
Copy link
Member Author

@benedictpaten I think those slides belong on #142 instead?

@benedictpaten
Copy link
Contributor

Oops, yes. Hold on.. need coffee or something.

@skeenan
Copy link
Member

skeenan commented Oct 29, 2014

Is this ready to be closed?

@skeenan
Copy link
Member

skeenan commented Apr 8, 2015

The last comment on this was Oct 2014 I'm closing this in 2 days.

@skeenan
Copy link
Member

skeenan commented Apr 10, 2015

Closing due to inactivity. This can be reopened if necessary.

@skeenan skeenan closed this as completed Apr 10, 2015
dcolligan pushed a commit to dcolligan/ga4gh-schemas that referenced this issue Jul 20, 2016
Filename in convert_to_binary.sh is now correct.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests