Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reference broadcasting to BQSR #928

Merged
merged 1 commit into from
Oct 7, 2015
Merged

Add reference broadcasting to BQSR #928

merged 1 commit into from
Oct 7, 2015

Conversation

laserson
Copy link
Contributor

please review any of @droazen @tomwhite @davidadamsphd

@laserson
Copy link
Contributor Author

(Also note this adds .2bit chr20/21 reference to LFS which is about 27 MB)

@droazen droazen self-assigned this Sep 25, 2015
@droazen
Copy link
Collaborator

droazen commented Sep 25, 2015

Will do a review pass first thing Monday -- @tomwhite and/or @davidadamsphd should also have a look.

public class BroadcastJoinReadsWithRefBases {
public static JavaPairRDD<GATKRead, ReferenceBases> addBases(final ReferenceDataflowSource referenceDataflowSource,
final JavaRDD<GATKRead> reads) {
JavaSparkContext ctx = new JavaSparkContext(reads.context());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd that reads.context() doesn't return a JavaSparkContext. This is fine though since it's reusing the underlying context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Strange API choice.

@tomwhite
Copy link
Contributor

I added a few fairly minor comments. Curious to see the performance implications.


}

private static void assertSameReads(final JavaRDD<GATKRead> reads,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd like to keep this function around to verify we're not losing reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's no longer being used. If anything, this should be moved into a test somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to test sounds good to me.

@davidadamsphd
Copy link
Contributor

@laserson, any chance you can add support for fasta files in a follow up? (I remember @tomwhite did some work on this, but I don't recall the current state).

@@ -35,7 +34,13 @@
public ReferenceDataflowSource(final PipelineOptions pipelineOptions, final String referenceURL,
final SerializableFunction<GATKRead, SimpleInterval> referenceWindowFunction) {
Utils.nonNull(referenceWindowFunction);
if (isFasta(referenceURL)) {
if (referenceURL.endsWith(".2bit")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare this extension as a named constant at the top of the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See note above about similar issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -14,6 +14,7 @@
import org.broadinstitute.hellbender.engine.dataflow.datasources.ReferenceDataflowSource;
import org.broadinstitute.hellbender.utils.SimpleInterval;
import org.broadinstitute.hellbender.utils.read.GATKRead;
import org.broadinstitute.hellbender.utils.spark.JoinStrategy;
import org.broadinstitute.hellbender.utils.test.BaseTest;
import org.broadinstitute.hellbender.utils.test.FakeReferenceSource;
import org.broadinstitute.hellbender.utils.variant.Variant;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new unit test to AddContextDataToReadSparkUnitTest that tests the BROADCAST case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the DataProvider simply add a case for BROADCAST strategy as well. It will probably be inefficient, but for small data sets, all the ReferenceSources should work in a broadcast strategy as well I think.

@droazen
Copy link
Collaborator

droazen commented Sep 28, 2015

Have a couple of test failures still -- do we understand what's causing them?

JoinReadsWithVariantsSparkUnitTest. pairReadsAndVariantsTest[0]([org.broadinstitute.hellbender.utils.read.GoogleGenomicsReadToGATKReadAdapter@dc61e602, org.broadinstitute.hellbender.utils.read.GoogleGenomicsReadToGATKReadAdapter@784984f1, org.broadinstitute.hellbender.utils.read.GoogleGenomicsReadToGATKReadAdapter@4985b350, org.broadinstitute.hellbender.utils.read.GoogleGenomicsReadToGATKReadAdapter@497fa2e4, org.broadinstitute.hellbender.utils.read.GoogleGenomicsReadToGATKReadAdapter@4985b429], [MinimalVariant -- interval(1:170-180), snp(true), indel(false), MinimalVariant -- interval(1:210-220), snp(false), indel(true), MinimalVariant -- interval(1:10000-10000), snp(true), indel(false), MinimalVariant -- interval(1:29998-30002), snp(false), indel(true), MinimalVariant -- interval(2:10000-10000), snp(false), indel(true)], [KV(org.broadinstitute.hellbender.utils.read.GoogleGenomicsReadToGATKReadAdapter@784984f1, [MinimalVariant -- interval(1:210-220), snp(false), indel(true), MinimalVariant -- interval(1:170-180), snp(true), indel(false)]), KV(org.broadinstitute.hellbender.utils.read.GoogleGenomicsReadToGATKReadAdapter@4985b350, [MinimalVariant -- interval(1:10000-10000), snp(true), indel(false)]), KV(org.broadinstitute.hellbender.utils.read.GoogleGenomicsReadToGATKReadAdapter@497fa2e4, [MinimalVariant -- interval(1:29998-30002), snp(false), indel(true)]), KV(org.broadinstitute.hellbender.utils.read.GoogleGenomicsReadToGATKReadAdapter@4985b429, [MinimalVariant -- interval(2:10000-10000), snp(false), indel(true)]), KV(org.broadinstitute.hellbender.utils.read.GoogleGenomicsReadToGATKReadAdapter@dc61e602, [null])])
JoinReadsWithVariantsSparkUnitTest. pairReadsAndVariantsTest[1]([1 50b aligned read., 2 100b aligned read., 3 10b aligned read., 4 10b aligned read., 5 10b aligned read.], [MinimalVariant -- interval(1:170-180), snp(true), indel(false), MinimalVariant -- interval(1:210-220), snp(false), indel(true), MinimalVariant -- interval(1:10000-10000), snp(true), indel(false), MinimalVariant -- interval(1:29998-30002), snp(false), indel(true), MinimalVariant -- interval(2:10000-10000), snp(false), indel(true)], [KV(2 100b aligned read., [MinimalVariant -- interval(1:210-220), snp(false), indel(true), MinimalVariant -- interval(1:170-180), snp(true), indel(false)]), KV(3 10b aligned read., [MinimalVariant -- interval(1:10000-10000), snp(true), indel(false)]), KV(4 10b aligned read., [MinimalVariant -- interval(1:29998-30002), snp(false), indel(true)]), KV(5 10b aligned read., [MinimalVariant -- interval(2:10000-10000), snp(false), indel(true)]), KV(1 50b aligned read., [null])])
BaseRecalibratorSparkFnUnitTest. testApply[0](/home/travis/build/broadinstitute/hellbender/src/test/resources/human_g1k_v37.chr17_1Mb.fasta, src/test/resources/org/broadinstitute/hellbender/tools/BQSR/NA12878.chr17_69k_70k.dictFix.bam, src/test/resources/org/broadinstitute/hellbender/tools/BQSR/dbsnp_132.b37.excluding_sites_after_129.chr17_69k_70k.vcf, src/test/resources/org/broadinstitute/hellbender/tools/BQSR/expected.NA12878.chr17_69k_70k.txt)

@droazen
Copy link
Collaborator

droazen commented Sep 28, 2015

Review complete -- back to @laserson

@droazen droazen assigned laserson and unassigned droazen Sep 28, 2015
@akiezun akiezun modified the milestone: alpha Oct 3, 2015
@laserson
Copy link
Contributor Author

laserson commented Oct 6, 2015

Ok, @droazen, I think just a few comments for you to address, and we're almost finished.

@droazen
Copy link
Collaborator

droazen commented Oct 6, 2015

@laserson Ok, responded to the latest round of comments. Ping me when this is rebased/finalized and passing tests, and I'll take a last look and merge.

@laserson
Copy link
Contributor Author

laserson commented Oct 6, 2015

Thanks! One issue is that the tests here will fail until the newest ADAM is released later this week :-/ I don't mind waiting if you don't. Otherwise, I could disable the tests for now.

@droazen
Copy link
Collaborator

droazen commented Oct 6, 2015

I vote to disable the test(s) in question in the interests of getting this merged (@akiezun feel free to chime in if you disagree) -- let's open a ticket to remind us to turn them back on when we rev ADAM, though.

@akiezun
Copy link
Contributor

akiezun commented Oct 6, 2015

fine, disable and enter a ticket to reenable.

Adds a new file to LFS: human_g1k_v37.20.21.2bit. It was by applying the faToTwoBit binary from the "kent" UCSC utilities package on the similarly named fasta reference already in the repo.
@laserson
Copy link
Contributor Author

laserson commented Oct 7, 2015

I anticipate this should pass, and I believe all issues have been addressed.

@droazen
Copy link
Collaborator

droazen commented Oct 7, 2015

👍 merging!

droazen added a commit that referenced this pull request Oct 7, 2015
@droazen droazen merged commit a4cadc3 into master Oct 7, 2015
@droazen droazen deleted the tw_ul_spark_bqsr branch October 7, 2015 15:27
@laserson
Copy link
Contributor Author

laserson commented Oct 7, 2015

hallelujah. sorry that took so long

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants