New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1670] Add ability to selectively project VCF fields. #1671

Merged
merged 1 commit into from Oct 18, 2017

Conversation

Projects
4 participants
@fnothaft
Member

fnothaft commented Aug 18, 2017

Resolves #1670. Adds a method loadVcfWithProjection to ADAMContext. This method takes a list of INFO/FORMAT field names, which are then applied as filters to the set of header lines used when creating the VariantContextConverter.

[ADAM-1670] Add ability to selectively project VCF fields.
Resolves #1670. Adds a method loadVcfWithProjection to ADAMContext. This method
takes a list of INFO/FORMAT field names, which are then applied as filters to
the set of header lines used when creating the VariantContextConverter.
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 18, 2017

Member

This is not ready to merge; please hold for testing.

Member

fnothaft commented Aug 18, 2017

This is not ready to merge; please hold for testing.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Aug 18, 2017

Coverage Status

Coverage increased (+0.08%) to 83.539% when pulling bcc1f25 on fnothaft:issues/1670-vcf-projection into 6c8f8d7 on bigdatagenomics:master.

coveralls commented Aug 18, 2017

Coverage Status

Coverage increased (+0.08%) to 83.539% when pulling bcc1f25 on fnothaft:issues/1670-vcf-projection into 6c8f8d7 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 18, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2319/
Test PASSed.

AmplabJenkins commented Aug 18, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2319/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Aug 22, 2017

Member

Curious how you are performance testing this method, and how we might separate cost on our side vs that incurred in htsjdk, because isn't htsjdk parsing all the fields anyway?

Member

heuermh commented Aug 22, 2017

Curious how you are performance testing this method, and how we might separate cost on our side vs that incurred in htsjdk, because isn't htsjdk parsing all the fields anyway?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 12, 2017

Member

@heuermh will benchmark this and report back.

Member

fnothaft commented Sep 12, 2017

@heuermh will benchmark this and report back.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 18, 2017

Member

In single node mode, loadVcf is 2.5x the runtime of a streaming implementation, whereas loadVcfWithProjections is 1.3x.

In clustered mode on Yarn with the data on HDFS, loadVcf is 2.4x the runtime of a streaming implementation, whereas loadVcfWithProjections is 1.4x.

loadVcfWithProjections has an added benefit in that it allows the user to filter out known errors in the VCF file.

$ dsh-bio filter-vcf --filter -i IN -o OUT

$ adam-shell -i filter-vcf.scala

import org.bdgenomics.adam.rdd.ADAMContext._
val variants = sc.loadVcf("IN").toVariants()
val filtered = variants.transform(rdd => { rdd.filter(_.getFiltersPassed()) })
filtered.toVariantContexts().saveAsVcf("OUT", asSingleFile = true, deferMerging = false, disableFastConcat = false, stringency = htsjdk.samtools.ValidationStringency.SILENT)
System.exit(0)

$ adam-shell -i filter-filtered-vcf.scala

import org.bdgenomics.adam.rdd.ADAMContext._
val variants = sc.loadVcfWithProjection("IN", Set.empty, Set.empty).toVariants()
val filtered = variants.transform(rdd => { rdd.filter(_.getFiltersPassed()) })
filtered.toVariantContexts().saveAsVcf("IN", asSingleFile = true, deferMerging = false, disableFastConcat = false, stringency = htsjdk.samtools.ValidationStringency.SILENT)
System.exit(0)
Member

heuermh commented Oct 18, 2017

In single node mode, loadVcf is 2.5x the runtime of a streaming implementation, whereas loadVcfWithProjections is 1.3x.

In clustered mode on Yarn with the data on HDFS, loadVcf is 2.4x the runtime of a streaming implementation, whereas loadVcfWithProjections is 1.4x.

loadVcfWithProjections has an added benefit in that it allows the user to filter out known errors in the VCF file.

$ dsh-bio filter-vcf --filter -i IN -o OUT

$ adam-shell -i filter-vcf.scala

import org.bdgenomics.adam.rdd.ADAMContext._
val variants = sc.loadVcf("IN").toVariants()
val filtered = variants.transform(rdd => { rdd.filter(_.getFiltersPassed()) })
filtered.toVariantContexts().saveAsVcf("OUT", asSingleFile = true, deferMerging = false, disableFastConcat = false, stringency = htsjdk.samtools.ValidationStringency.SILENT)
System.exit(0)

$ adam-shell -i filter-filtered-vcf.scala

import org.bdgenomics.adam.rdd.ADAMContext._
val variants = sc.loadVcfWithProjection("IN", Set.empty, Set.empty).toVariants()
val filtered = variants.transform(rdd => { rdd.filter(_.getFiltersPassed()) })
filtered.toVariantContexts().saveAsVcf("IN", asSingleFile = true, deferMerging = false, disableFastConcat = false, stringency = htsjdk.samtools.ValidationStringency.SILENT)
System.exit(0)

@heuermh heuermh merged commit f06bbe8 into bigdatagenomics:master Oct 18, 2017

3 checks passed

codacy/pr Good work! A positive pull request.
Details
coverage/coveralls Coverage increased (+0.08%) to 83.539%
Details
default Merged build finished.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 18, 2017

Member

Thank you, @fnothaft

Member

heuermh commented Oct 18, 2017

Thank you, @fnothaft

@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment