Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1670] Add ability to selectively project VCF fields. #1671

Merged
merged 1 commit into from Oct 18, 2017

Conversation

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Aug 18, 2017

Resolves #1670. Adds a method loadVcfWithProjection to ADAMContext. This method takes a list of INFO/FORMAT field names, which are then applied as filters to the set of header lines used when creating the VariantContextConverter.

Resolves #1670. Adds a method loadVcfWithProjection to ADAMContext. This method
takes a list of INFO/FORMAT field names, which are then applied as filters to
the set of header lines used when creating the VariantContextConverter.
@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Aug 18, 2017

This is not ready to merge; please hold for testing.

@coveralls
Copy link

@coveralls coveralls commented Aug 18, 2017

Coverage Status

Coverage increased (+0.08%) to 83.539% when pulling bcc1f25 on fnothaft:issues/1670-vcf-projection into 6c8f8d7 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Aug 18, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2319/
Test PASSed.

@heuermh
Copy link
Member

@heuermh heuermh commented Aug 22, 2017

Curious how you are performance testing this method, and how we might separate cost on our side vs that incurred in htsjdk, because isn't htsjdk parsing all the fields anyway?

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Sep 12, 2017

@heuermh will benchmark this and report back.

@heuermh
Copy link
Member

@heuermh heuermh commented Oct 18, 2017

In single node mode, loadVcf is 2.5x the runtime of a streaming implementation, whereas loadVcfWithProjections is 1.3x.

In clustered mode on Yarn with the data on HDFS, loadVcf is 2.4x the runtime of a streaming implementation, whereas loadVcfWithProjections is 1.4x.

loadVcfWithProjections has an added benefit in that it allows the user to filter out known errors in the VCF file.

$ dsh-bio filter-vcf --filter -i IN -o OUT

$ adam-shell -i filter-vcf.scala

import org.bdgenomics.adam.rdd.ADAMContext._
val variants = sc.loadVcf("IN").toVariants()
val filtered = variants.transform(rdd => { rdd.filter(_.getFiltersPassed()) })
filtered.toVariantContexts().saveAsVcf("OUT", asSingleFile = true, deferMerging = false, disableFastConcat = false, stringency = htsjdk.samtools.ValidationStringency.SILENT)
System.exit(0)

$ adam-shell -i filter-filtered-vcf.scala

import org.bdgenomics.adam.rdd.ADAMContext._
val variants = sc.loadVcfWithProjection("IN", Set.empty, Set.empty).toVariants()
val filtered = variants.transform(rdd => { rdd.filter(_.getFiltersPassed()) })
filtered.toVariantContexts().saveAsVcf("IN", asSingleFile = true, deferMerging = false, disableFastConcat = false, stringency = htsjdk.samtools.ValidationStringency.SILENT)
System.exit(0)
@heuermh heuermh merged commit f06bbe8 into bigdatagenomics:master Oct 18, 2017
3 checks passed
3 checks passed
codacy/pr Good work! A positive pull request.
Details
coverage/coveralls Coverage increased (+0.08%) to 83.539%
Details
default Merged build finished.
Details
@heuermh
Copy link
Member

@heuermh heuermh commented Oct 18, 2017

Thank you, @fnothaft

@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017
@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.