Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Load SV type info field - need for allele uniquness #1134
I'm having an issue of a CNV allele from 1000 genomes disappearing in my hbase work - and the same allele disappears if ones does a VariantContextRDD.toGenotypeRDD.toVariantContextRDD using the existing ADAM code.
The original VCF lines in question are below, two rows starting at the same position, with CN0 appearing in both.
I think the representation in VCF is somewhat dubious anyhow - and in fact the esv3647587 I think may well be redundantly represented in the original VCF and one copy could be dropped - however right now mainly I'd be happy if I had a way to generate a unique key for both the alleles above, as I need such a key for HBase - and a simple transform from mult-allele to single allele where EVERY alt in the original VCF gets its own row seems the simplest.
The key currently consists of contig_start_position_ref_alt
What would help is if I had access to the SVTYPE info field, as this is different between the two rows above.
However - it looks like we are not currently populating sv fields like:
as this is null when the above VCF data is loaded.
@heuermh can you remind me, or point me to the discussion, of what the final plan was on the future of structural variation fields like this?
In #1131 the
When the upgrade-to-bdg-formats-0.10.0 branch is merged,
Then we have an issue bigdatagenomics/bdg-formats#104 to revisit whether handling all the structural variant VCF INFO keys merits bringing back bdg-formats records in some form.
@jpdna This use case can be handled with attributes (see below). Can we close this issue?
$ cat svtype.vcf ##fileformat=VCFv4.1 ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> #CHROM POS ID REF ALT QUAL FILTER INFO 22 16050612 rs2186463 C G . . WGT=3;SVTYPE=DEL 22 16050612 rs146752890 C G . . WGT=1;SVTYPE=CNV $ ./bin/adam-shell scala> import org.bdgenomics.adam.rdd.ADAMContext._ import org.bdgenomics.adam.rdd.ADAMContext._ scala> val variants = sc.loadVariants("svtype.vcf") variants: org.bdgenomics.adam.rdd.variant.VariantRDD = VariantRDD(MapPartitionsRDD ... scala> variants.rdd.foreach(v => println(Seq(v.getContigName, v.getStart, v.getReferenceAllele, v.getAlternateAllele, v.getAnnotation().getAttributes.get("SVTYPE")).mkString("_"))) 22_16050611_C_G_DEL 22_16050611_C_G_CNV