Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Load SV type info field - need for allele uniquness #1134
I'm having an issue of a CNV allele from 1000 genomes disappearing in my hbase work - and the same allele disappears if ones does a VariantContextRDD.toGenotypeRDD.toVariantContextRDD using the existing ADAM code.
The original VCF lines in question are below, two rows starting at the same position, with CN0 appearing in both.
I think the representation in VCF is somewhat dubious anyhow - and in fact the esv3647587 I think may well be redundantly represented in the original VCF and one copy could be dropped - however right now mainly I'd be happy if I had a way to generate a unique key for both the alleles above, as I need such a key for HBase - and a simple transform from mult-allele to single allele where EVERY alt in the original VCF gets its own row seems the simplest.
The key currently consists of contig_start_position_ref_alt
What would help is if I had access to the SVTYPE info field, as this is different between the two rows above.
However - it looks like we are not currently populating sv fields like:
as this is null when the above VCF data is loaded.
@heuermh can you remind me, or point me to the discussion, of what the final plan was on the future of structural variation fields like this?
In #1131 the
When the upgrade-to-bdg-formats-0.10.0 branch is merged,
Then we have an issue bigdatagenomics/bdg-formats#104 to revisit whether handling all the structural variant VCF INFO keys merits bringing back bdg-formats records in some form.
@jpdna This use case can be handled with attributes (see below). Can we close this issue?
$ cat svtype.vcf ##fileformat=VCFv4.1 ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> #CHROM POS ID REF ALT QUAL FILTER INFO 22 16050612 rs2186463 C G . . WGT=3;SVTYPE=DEL 22 16050612 rs146752890 C G . . WGT=1;SVTYPE=CNV $ ./bin/adam-shell scala> import org.bdgenomics.adam.rdd.ADAMContext._ import org.bdgenomics.adam.rdd.ADAMContext._ scala> val variants = sc.loadVariants("svtype.vcf") variants: org.bdgenomics.adam.rdd.variant.VariantRDD = VariantRDD(MapPartitionsRDD ... scala> variants.rdd.foreach(v => println(Seq(v.getContigName, v.getStart, v.getReferenceAllele, v.getAlternateAllele, v.getAnnotation().getAttributes.get("SVTYPE")).mkString("_"))) 22_16050611_C_G_DEL 22_16050611_C_G_CNV