Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Extending the variant model for DUP, DEL structural variants #772

Closed
wants to merge 4 commits into from

Conversation

mbaudis
Copy link
Member

@mbaudis mbaudis commented Jan 9, 2017

This is a first attempt to address the need for representation of
structural variants. It is chiefly aimed to encode sequence
duplications/deletions, and follows examples from VCFv4.2.
The first commit does not yet implement the services part or documentation.

For some of the discussions, please see issue #752 .

This is a first commit to address the need for representation of
structural variants. It is chiefly aimed to encode sequence
duplications/deletions, and follows examples from VCFv4.2. This commit
does not yet implement the services part or documentation.
@david4096
Copy link
Member

Awesome @mbaudis! Looking forward to working with this in practice. Being able to easily interpret the type of the variant always seemed a useful canonical field to me! What do you think of making the type an enumeration as opposed to a string?

@mbaudis
Copy link
Member Author

mbaudis commented Jan 10, 2017

@david4096 I am neutral regarding enum vs. string; both need some way to document permitted values. This is mostly a schema design decision (we've been going back and forth, with the first version having enums, then those being discouraged ...).

For the code integration team to edit/decide (so go ahead ...).

@mbaudis mbaudis mentioned this pull request Jan 10, 2017
// Examples:
// DUP : duplication of sequence following "start"; not necessarily in situ
// DEL : deletion of sequence following "start"
string variant_type = 13;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use SO terms (eg SO:0000159, deletion) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, i.e. when not working with a limited (by doc, enumeration) list of terms =>

ontologyTerm variant_type = 13;

... would be the appropriate way to do it. Alas: overhead? @kozbo, @david4096 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 don't write an enum if someone else already has :).

Offering in the proto the variety of variants would be very helpful to other implementors.

duplication    (CURRENT_SVN)
SO Accession: 	SO:1000035 (SOWiki)

structural_variant    (CURRENT_SVN)
SO Accession: 	SO:0001537 (SOWiki)

insertion    (CURRENT_SVN)
SO Accession: 	SO:0000667 (SOWiki)

Copy link
Member Author

@mbaudis mbaudis Jan 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david4096 But not all of the variant types may (well, will) be supported by the schema (yet). So:

  • you need either A) a positive (i.e. defining a set of) or B) a selective (i.e. which of all defined in external reference) variants would constitute supported values
  • case A would naturally be an enumeration, B a documentation case with ontologyTerm (+ service)

Both have conceptual pros and cons, with in A ease/stringency, in B beauty/future speaking for them.

???

Copy link
Member

@david4096 david4096 Jan 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean you want to use a controlled vocabulary when it's available, and an open string otherwise? In that case we can make the field a oneof a string or ontology term. This makes constructing the message a little more baroque, but captures the cases nicely.

Client applications will have to test the field for its type before processing it.

The alternative would be to provide two named fields with some clever naming to represent the type difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david4096 Thought about that, but also the "baroque" :-)

I am open to any solution, as long as it is accepted & forward looking, and gets implemented. The implementation of other variant types is extremely important - SNVs are just a small subset (regarding genome coverage, functional impact; don't get blinded by their count number...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david4096 I actually had offered this as 2 ways to treat the problem, not as additive solution. You'll still have to document the allowed values in the OT object...

Copy link
Member

@david4096 david4096 Jan 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbaudis Great, sounds good to me! Variant callers want to be able to describe their own variant types and they use their own plain strings to do so. Giving the option of upgrading to a controlled vocabulary is something we might consider doing in other places in the API as well.

It is actually a very useful way to understand how the science moves forward: as new string values for variant type become popular, the sequence ontology has a strong case for reviewing adding it to their vocabulary.

VCF spec has SVTYPE https://samtools.github.io/hts-specs/VCFv4.2.pdf using BND in the type. I believe these need to be interchanged as well. This post is very helpful on the topic of SV calls: https://simpsonlab.github.io/2015/06/15/merging-sv-calls/ .

This is the VT field used in TCGA:

##INFO=<ID=VT,Number=1,Type=String,Description="Variant type, can be SNP, INS , DEL, DNP, TNP, ONP or Consolidated">

https://wiki.nci.nih.gov/display/TCGA/TCGA+Variant+Call+Format+%28VCF%29+Specification

In practice, it is difficult to write code against unknown keys, where one VCF might present the VT field, while another makes use of a different metadata item. One can try to make assumptions about whether the VT info key is being used for variant type, but it is not a part of the spec.

These complexities can of course be left up to implementors. Perhaps VCF 4.4 could have a column for variant type, wherein a list of variant types are curated (SNP, INS...) which map to SO terms, making a clear link between SO and VCF.

samtools/hts-specs#183

// DEL : deletion of sequence following "start"
string variant_type = 13;

// Length of the - if labeled as such in variant_type - somatic variation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be 'structural' rather than 'somatic'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - shows my subconscious bias...

somatic => structural fix
@david4096
Copy link
Member

Closed in favor of #827

@david4096 david4096 closed this Mar 4, 2017
@mbaudis mbaudis deleted the variants-structural-extension branch April 11, 2017 12:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants