Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load SV type info field - need for allele uniquness #1134

jpdna opened this issue Aug 26, 2016 · 3 comments

Load SV type info field - need for allele uniquness #1134

jpdna opened this issue Aug 26, 2016 · 3 comments


Copy link

@jpdna jpdna commented Aug 26, 2016

I'm having an issue of a CNV allele from 1000 genomes disappearing in my hbase work - and the same allele disappears if ones does a VariantContextRDD.toGenotypeRDD.toVariantContextRDD using the existing ADAM code.

The original VCF lines in question are below, two rows starting at the same position, with CN0 appearing in both.

22  31876537    esv3647587  A   <CN0>   100 PASS    AC=2;AF=0.000399361;AN=5008;CIEND=0,500;CIPOS=-500,0;CS=DEL_union;END=31882098;NS=2504;SVTYPE=DEL;DP=17915;EAS_AF=0;AMR_AF=0;AFR_AF=0.0015;EUR_AF=0;SAS_AF=0;VT=SV  GT  0|0

22  31876537    esv3647587;esv3647588   A   <CN0>,<CN2> 100 PASS    AC=2,2;AF=0.000399361,0.000399361;AN=5008;CS=DUP_gs;END=31882098;NS=2504;SVTYPE=CNV;DP=17915;EAS_AF=0,0;AMR_AF=0,0;AFR_AF=0.0015,0;EUR_AF=0,0.001;SAS_AF=0,0.001;VT=SV  GT  0|0

I think the representation in VCF is somewhat dubious anyhow - and in fact the esv3647587 I think may well be redundantly represented in the original VCF and one copy could be dropped - however right now mainly I'd be happy if I had a way to generate a unique key for both the alleles above, as I need such a key for HBase - and a simple transform from mult-allele to single allele where EVERY alt in the original VCF gets its own row seems the simplest.

The key currently consists of contig_start_position_ref_alt
I've considered add "end_position" as well but in this case the end is the same, so doesn't help.

What would help is if I had access to the SVTYPE info field, as this is different between the two rows above.

However - it looks like we are not currently populating sv fields like:

as this is null when the above VCF data is loaded.

@heuermh can you remind me, or point me to the discussion, of what the final plan was on the future of structural variation fields like this?

Copy link

@heuermh heuermh commented Aug 26, 2016

In #1131 the StructuralVariant and StructuralVariantType bdg-formats records are removed, which doesn't matter much, since as you mention they and the svAllele field you linked to were never populated.

When the upgrade-to-bdg-formats-0.10.0 branch is merged, SVTYPE will be available in the attributes field of VariantAnnotation.

Then we have an issue bigdatagenomics/bdg-formats#104 to revisit whether handling all the structural variant VCF INFO keys merits bringing back bdg-formats records in some form.

@heuermh heuermh modified the milestone: 0.20.0 Sep 8, 2016
@heuermh heuermh modified the milestones: 0.20.0, 0.22.0 Oct 13, 2016
@heuermh heuermh added this to Triage in Release 0.23.0 Mar 8, 2017
@fnothaft fnothaft modified the milestones: 0.24.0, 0.23.0 May 12, 2017
@fnothaft fnothaft removed this from Triage in Release 0.23.0 May 12, 2017
Copy link

@heuermh heuermh commented Jun 1, 2017

@jpdna This use case can be handled with attributes (see below). Can we close this issue?

$ cat svtype.vcf
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
22	16050612	rs2186463	C	G	.	.	WGT=3;SVTYPE=DEL
22	16050612	rs146752890	C	G	.	.	WGT=1;SVTYPE=CNV

$ ./bin/adam-shell

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val variants = sc.loadVariants("svtype.vcf")
variants: org.bdgenomics.adam.rdd.variant.VariantRDD = VariantRDD(MapPartitionsRDD[2] ...

scala> variants.rdd.foreach(v => println(Seq(v.getContigName, v.getStart, v.getReferenceAllele, v.getAlternateAllele, v.getAnnotation().getAttributes.get("SVTYPE")).mkString("_"))) 
Copy link

@heuermh heuermh commented Nov 7, 2017

Closing as not an issue.

@heuermh heuermh closed this Nov 7, 2017
@heuermh heuermh modified the milestones: 0.24.0, 0.23.0 Dec 7, 2017
@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.