New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BQSR updates #213

Merged
merged 7 commits into from Apr 22, 2014

Conversation

Projects
None yet
5 participants
@jey
Contributor

jey commented Apr 9, 2014

This adds the CycleCovariate and some miscellaneous fixes to BQSR

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins commented Apr 9, 2014

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/273/

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Apr 9, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/274/

AmplabJenkins commented Apr 9, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/274/

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Apr 9, 2014

Member

Is the one 16k line file necessary? Like, is it used for testing, or something of the sort?

Member

fnothaft commented Apr 9, 2014

Is the one 16k line file necessary? Like, is it used for testing, or something of the sort?

@jey

This comment has been minimized.

Show comment
Hide comment
@jey

jey Apr 9, 2014

Contributor

Yeah, it's used for testing, and contains the reference values. With the addition of the CycleCovariate the number of combinations in the table went from ~500 to ~16000.

I'm open to other suggestions for testing this!

Contributor

jey commented Apr 9, 2014

Yeah, it's used for testing, and contains the reference values. With the addition of the CycleCovariate the number of combinations in the table went from ~500 to ~16000.

I'm open to other suggestions for testing this!

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Apr 9, 2014

Member

Ah! OK. I was just curious.

Member

fnothaft commented Apr 9, 2014

Ah! OK. I was just curious.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Apr 9, 2014

Member

@jey Can you run mvn org.scalariform:scalariform-maven-plugin:format on this code and push the changes to this pull request?

Member

massie commented Apr 9, 2014

@jey Can you run mvn org.scalariform:scalariform-maven-plugin:format on this code and push the changes to this pull request?

@jey

This comment has been minimized.

Show comment
Hide comment
@jey

jey Apr 9, 2014

Contributor

Hi Matt, I ran mvn process-sources while I was rebasing this onto master. I just tried it again and nothing changed, though running mvn org.scalariform:scalariform-maven-plugin:format seems to reformat a lot more than just my changes using the Scalariform default settings.

Is process-sources or org.scalariform:scalariform-maven-plugin:format the recommended way? Here's the output I get when I run the second one: jey@1c2b59b

Contributor

jey commented Apr 9, 2014

Hi Matt, I ran mvn process-sources while I was rebasing this onto master. I just tried it again and nothing changed, though running mvn org.scalariform:scalariform-maven-plugin:format seems to reformat a lot more than just my changes using the Scalariform default settings.

Is process-sources or org.scalariform:scalariform-maven-plugin:format the recommended way? Here's the output I get when I run the second one: jey@1c2b59b

@jey

This comment has been minimized.

Show comment
Hide comment
@jey

jey Apr 9, 2014

Contributor

I'm personally a fan of the default settings, without the alignment, since that's easier to write by hand.

Contributor

jey commented Apr 9, 2014

I'm personally a fan of the default settings, without the alignment, since that's easier to write by hand.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Apr 9, 2014

Member

org.scalariform:scalariform-maven-plugin:format is the recommended way.

Can you rebase this onto the latest master? Thanks, Jey.

Member

massie commented Apr 9, 2014

org.scalariform:scalariform-maven-plugin:format is the recommended way.

Can you rebase this onto the latest master? Thanks, Jey.

@jey

This comment has been minimized.

Show comment
Hide comment
@jey

jey Apr 9, 2014

Contributor

Done

Contributor

jey commented Apr 9, 2014

Done

@nealsid

This comment has been minimized.

Show comment
Hide comment
@nealsid

nealsid Apr 9, 2014

Contributor

Yup, the way @massie specified is the way to go
(Actually, running "process-sources" won't work anymore because we didn't
want reformatting to be part of the normal build process)

On Wed, Apr 9, 2014 at 2:18 PM, Jey Kottalam notifications@github.comwrote:

Done


Reply to this email directly or view it on GitHubhttps://github.com//pull/213#issuecomment-39998237
.

Contributor

nealsid commented Apr 9, 2014

Yup, the way @massie specified is the way to go
(Actually, running "process-sources" won't work anymore because we didn't
want reformatting to be part of the normal build process)

On Wed, Apr 9, 2014 at 2:18 PM, Jey Kottalam notifications@github.comwrote:

Done


Reply to this email directly or view it on GitHubhttps://github.com//pull/213#issuecomment-39998237
.

@jey

This comment has been minimized.

Show comment
Hide comment
@jey

jey Apr 9, 2014

Contributor

Oh OK, thanks!

Contributor

jey commented Apr 9, 2014

Oh OK, thanks!

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins commented Apr 9, 2014

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/277/

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Apr 9, 2014

Member

Jenkins, test this please.

Member

massie commented Apr 9, 2014

Jenkins, test this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Apr 9, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/278/

AmplabJenkins commented Apr 9, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/278/

@@ -181,7 +182,7 @@ class ADAMRecordRDDFunctions(rdd: RDD[ADAMRecord]) extends ADAMSequenceDictionar
MarkDuplicates(rdd)
}
def adamBQSR(knownSnps: SnpTable): RDD[ADAMRecord] = {
def adamBQSR(knownSnps: Broadcast[SnpTable]): RDD[ADAMRecord] = {

This comment has been minimized.

@fnothaft

fnothaft Apr 9, 2014

Member

IMO, it would be preferable to have:

def adamBQSR(knownSnps: SnpTable): RDD[ADAMRecord] = {
     BaseQualityRecalibration(rdd, rdd.context.broadcast(knownSnps))

I think this is a bit cleaner, would be glad to hear comments.

@fnothaft

fnothaft Apr 9, 2014

Member

IMO, it would be preferable to have:

def adamBQSR(knownSnps: SnpTable): RDD[ADAMRecord] = {
     BaseQualityRecalibration(rdd, rdd.context.broadcast(knownSnps))

I think this is a bit cleaner, would be glad to hear comments.

This comment has been minimized.

@jey

jey Apr 11, 2014

Contributor

I did it this way so that the variable would only be broadcast once. In practical terms, this is moot right now since we don't reuse it except in BQSR, but this would make it clear to avoid re-broadcasting the SnpTable in the future. I'm open to either approach.

@jey

jey Apr 11, 2014

Contributor

I did it this way so that the variable would only be broadcast once. In practical terms, this is moot right now since we don't reuse it except in BQSR, but this would make it clear to avoid re-broadcasting the SnpTable in the future. I'm open to either approach.

residue.isRegularBase &&
!residue.isInsertion &&
!knownSnps.value.isMasked(residue)
private def dumpVisits(filename: String) = {

This comment has been minimized.

@fnothaft

fnothaft Apr 9, 2014

Member

What is a visit? Can you add docs?

@fnothaft

fnothaft Apr 9, 2014

Member

What is a visit? Can you add docs?

This comment has been minimized.

@massie

massie Apr 21, 2014

Member

I pretty sure Jey's talking about the visitor pattern here.

@massie

massie Apr 21, 2014

Member

I pretty sure Jey's talking about the visitor pattern here.

@@ -30,11 +30,6 @@ import scala.collection.mutable
class Observation(val total: Long, val mismatches: Long) extends Serializable {
require(mismatches >= 0 && mismatches <= total)
/**
* Whether to emulate GATK's calculations.

This comment has been minimized.

@fnothaft

fnothaft Apr 9, 2014

Member

What changes if you do/don't emulate GATK? Can you add docs?

@fnothaft

fnothaft Apr 9, 2014

Member

What changes if you do/don't emulate GATK? Can you add docs?

This comment has been minimized.

@massie

massie Apr 21, 2014

Member

This was mainly used for debugging more than anything. This change makes us as concordant with GATK as possible without inheriting their bugs.

@massie

massie Apr 21, 2014

Member

This was mainly used for debugging more than anything. This change makes us as concordant with GATK as possible without inheriting their bugs.

This comment has been minimized.

@jey

jey Apr 22, 2014

Contributor

Yup, exactly. The old code was trying to use the same logic as GATK 1.6, but the Bayesian approach is both more principled/correct and now gets >99.99% concordance with GATK.

I'm not sure why we get lower concordance when using GATK 1.6's logic, but I'm guessing it's because of two effects: a) their complicated logic basically reduces to the same as ours, and b) I didn't reproduce some aspect of their formula accurately.

@jey

jey Apr 22, 2014

Contributor

Yup, exactly. The old code was trying to use the same logic as GATK 1.6, but the Bayesian approach is both more principled/correct and now gets >99.99% concordance with GATK.

I'm not sure why we get lower concordance when using GATK 1.6's logic, but I'm guessing it's because of two effects: a) their complicated logic basically reduces to the same as ours, and b) I didn't reproduce some aspect of their formula accurately.

@@ -136,6 +139,12 @@ class DecadentRead(val record: RichADAMRecord) extends Logging {
def isDuplicate: Boolean = record.getDuplicateRead
def isPaired: Boolean = record.getReadPaired

This comment has been minimized.

@fnothaft

fnothaft Apr 9, 2014

Member

Why do we keep these values around, since they can be accessed directly from the read?

@fnothaft

fnothaft Apr 9, 2014

Member

Why do we keep these values around, since they can be accessed directly from the read?

This comment has been minimized.

@jey

jey Apr 22, 2014

Contributor

The intention here is to provide a unified clean interface to the underlying data, as opposed to having to know when to use a DecadentRead method and when to drop down to the ADAMRecord.

@jey

jey Apr 22, 2014

Contributor

The intention here is to provide a unified clean interface to the underlying data, as opposed to having to know when to use a DecadentRead method and when to drop down to the ADAMRecord.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Apr 9, 2014

Member

Looks good; dropped comments inline.

Member

fnothaft commented Apr 9, 2014

Looks good; dropped comments inline.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Apr 18, 2014

Member

Is this waiting on anything (except for a squash) before it is good to merge?

Member

fnothaft commented Apr 18, 2014

Is this waiting on anything (except for a squash) before it is good to merge?

@@ -226,24 +226,24 @@
</javacArgs>

This comment has been minimized.

@massie

massie Apr 21, 2014

Member

Can you remove this formatting-only change?

@massie

massie Apr 21, 2014

Member

Can you remove this formatting-only change?

This comment has been minimized.

@jey

jey Apr 22, 2014

Contributor

(As we discussed in person,) this is replacing tabs in the pom.xml with spaces for the sake of consistency

@jey

jey Apr 22, 2014

Contributor

(As we discussed in person,) this is replacing tabs in the pom.xml with spaces for the sake of consistency

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Apr 21, 2014

Member

This looks ready to go to me. Can you remove the pom.xml file form the pull request and I'll merge it?

Thanks, Jey!

Member

massie commented Apr 21, 2014

This looks ready to go to me. Can you remove the pom.xml file form the pull request and I'll merge it?

Thanks, Jey!

@jey

This comment has been minimized.

Show comment
Hide comment
@jey

jey Apr 22, 2014

Contributor

Sorry for letting this languish so long! I've addressed the remaining issues and this is ready to merge.

Contributor

jey commented Apr 22, 2014

Sorry for letting this languish so long! I've addressed the remaining issues and this is ready to merge.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Apr 22, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/287/

AmplabJenkins commented Apr 22, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/287/

massie added a commit that referenced this pull request Apr 22, 2014

@massie massie merged commit a13c758 into bigdatagenomics:master Apr 22, 2014

1 check passed

default Merged build finished.
Details
@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Apr 22, 2014

Member

Thanks, Jey!

Member

massie commented Apr 22, 2014

Thanks, Jey!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment