New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF annotation question #1994

Closed
pstawinski opened this Issue Jun 11, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@pstawinski
Copy link

pstawinski commented Jun 11, 2018

Hi there,

I would like to use ADAM to annotate VCF file. I have not found any example how to do so in the docs, it's why I'm writing here. I'd like to read a VCF file, add some annotations (for example new INFO fields with according header lines) and save it back to VCF.
Is it possible?

Thank you in advance,
Piotr

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Jun 11, 2018

Hello @pstawinski!

Adding new header lines is straightforward

import org.bdgenomics.adam.rdd.ADAMContext._
import htsjdk.variant.vcf.VCFHeaderLineType

val vcs = sc.loadVcf("sample.vcf")

val vcsWithHeaderLine = vcs.addScalarInfoHeaderLine(
  "key", "Custom key INFO field", VCFHeaderLineType.String)

vcsWithHeaderLine.saveAsVcf("sample-info-key.vcf")

See
http://static.javadoc.io/org.bdgenomics.adam/adam-core-spark2_2.11/0.24.0/index.html#org.bdgenomics.adam.rdd.variant.VariantContextRDD

The code for adding new attributes looks different if you're using RDDs or Datasets, which are you working with?

@pstawinski

This comment has been minimized.

Copy link

pstawinski commented Jun 11, 2018

Support of ADAM is really amazing! Thank you heuermh!

I prefer to work with Datasets.

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Jun 11, 2018

ok, with Datasets it is a little bit verbose, because currently the classes generated to the org.bdgenomics.adam.sql package are Scala Products but not case class (see #1972)

import htsjdk.variant.vcf.VCFHeaderLineType

import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.sql.Variant
import org.bdgenomics.adam.sql.VariantAnnotation
import org.bdgenomics.adam.sql.VariantContext

val vcs = sc.loadVcf("sample.vcf")

val vcsWithHeaderLine = vcs.addScalarInfoHeaderLine(
  "key", "Custom key INFO field", VCFHeaderLineType.String)

def addInfo(vc: VariantContext): VariantContext = {
  vc.variant.annotation.fold(vc)(annotation => {
    // add new attribute
    val attributes = annotation.attributes + ("key" -> "value")

    // copy to new annotation
    val annotationCopy = new VariantAnnotation(
      alleleCount = annotation.alleleCount,
      alleleFrequency = annotation.alleleFrequency,
      ancestralAllele = annotation.ancestralAllele,
      attributes = attributes,
      cigar = annotation.cigar,
      dbSnp = annotation.dbSnp,
      forwardReadDepth = annotation.forwardReadDepth,
      hapMap2 = annotation.hapMap2,
      hapMap3 = annotation.hapMap3,
      readDepth = annotation.readDepth,
      referenceForwardReadDepth = annotation.referenceForwardReadDepth,
      referenceReadDepth = annotation.referenceReadDepth,
      referenceReverseReadDepth = annotation.referenceReverseReadDepth,
      reverseReadDepth = annotation.reverseReadDepth,
      somatic = annotation.somatic,
      thousandGenomes = annotation.thousandGenomes,
      transcriptEffects = annotation.transcriptEffects,
      validated = annotation.validated
    )

    // copy to new variant
    val variantCopy = new Variant(
      alternateAllele = vc.variant.alternateAllele,
      annotation = Some(annotationCopy),
      contigName = vc.variant.contigName,
      end = vc.variant.end,
      filtersApplied = vc.variant.filtersApplied,
      filtersFailed = vc.variant.filtersFailed,
      filtersPassed = vc.variant.filtersPassed,
      names = vc.variant.names,
      quality = vc.variant.quality,
      referenceAllele = vc.variant.referenceAllele,
      splitFromMultiAllelic = vc.variant.splitFromMultiAllelic,
      start = vc.variant.start
    )

    // copy to new variant context
    vc.copy(variant = variantCopy)
  })
}

val vcsWithAttribute = vcsWithHeaderLine.transformDataset(dataset => dataset.map(addInfo))

vcsWithAttribute.saveAsVcf("small-info-key.vcf")
@pstawinski

This comment has been minimized.

Copy link

pstawinski commented Jun 12, 2018

Thanks a million, Michael!

@pstawinski pstawinski closed this Jun 12, 2018

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Jun 12, 2018

FYI, now that #1972 has been merged, on git head this is a lot shorter

import htsjdk.variant.vcf.VCFHeaderLineType

import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.sql.Variant
import org.bdgenomics.adam.sql.VariantAnnotation
import org.bdgenomics.adam.sql.VariantContext

val vcs = sc.loadVcf("adam-core/src/test/resources/small.vcf")

val vcsWithHeaderLine = vcs.addScalarInfoHeaderLine(
  "key", "Custom key INFO field", VCFHeaderLineType.String)

def addInfo(vc: VariantContext): VariantContext = {
  vc.variant.annotation.fold(vc)(annotation => {
    val attributes = annotation.attributes + ("key" -> "value")
    val annotationCopy = annotation.copy(attributes = attributes)
    val variantCopy = vc.variant.copy(annotation = Some(annotationCopy))
    vc.copy(variant = variantCopy)
  })
}

val vcsWithAttribute = vcsWithHeaderLine.transformDataset(dataset => dataset.map(addInfo))

vcsWithAttribute.saveAsVcf("small-info-key.vcf")

@heuermh heuermh added this to the 0.24.1 milestone Aug 28, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment