BQSR should support recalibration across multiple ADAM files #58

Closed
tdanford opened this Issue Jan 22, 2014 · 5 comments

Comments

Projects
None yet
2 participants
@tdanford
Contributor

tdanford commented Jan 22, 2014

Per the discussion here: #48 (comment)

It doesn't appear that BQSR could currently be (correctly) run across multiple ADAM files with overlapping read groups (or possibly multiple ADAM files with different read groups either, I'm not sure).

We should probably (a) make that clearer in the documentation to BQSR, and (b) correct it (by building or maintaining a master read-group map that can be used to correctly define the read group covariate across files).

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jan 22, 2014

Member

Timothy,

A question for you here: what exactly is the use case here? It seems to me that if you were going to run BQSR across data that was in multiple files, you would take the union of the files, then run BQSR on that union.

I'm not saying it's not done (as people have provided evidence that it is), but I don't entirely understand what the "correct" multi-file BQSR use model is, and would appreciate some clarification.

Member

fnothaft commented Jan 22, 2014

Timothy,

A question for you here: what exactly is the use case here? It seems to me that if you were going to run BQSR across data that was in multiple files, you would take the union of the files, then run BQSR on that union.

I'm not saying it's not done (as people have provided evidence that it is), but I don't entirely understand what the "correct" multi-file BQSR use model is, and would appreciate some clarification.

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Jan 22, 2014

Contributor

Frank, I'm assuming you've been watching the discussion on this commit: arahuja@f4e30a5

So if you have read groups A & B for a sample, but those read groups come to you in two different BAM files which you convert separately into ADAM (because they arrive at two separate times)... my understanding of the code is that both A & B would be assigned a recordGroupId of 0.

And then BQSR wouldn't distinguish reads from the two read groups as having different covariates too, right?

That's the main problem, I think. Longer term, I suspect that there will be people who will want to run BQSR or some updated version of it across multiple samples -- but again, you get the same problem if they're converted separately using the current code.

This issue is more a reminder, that we should think about how to address this problem in the future. It can be gotten around if you pre-merge the BAMs before conversion, but we should be clear that you need to do that. Otherwise, I assume you won't see an error, just a subtle mis-correction of base quality scores.

Please correct me if I'm wrong, though...

Contributor

tdanford commented Jan 22, 2014

Frank, I'm assuming you've been watching the discussion on this commit: arahuja@f4e30a5

So if you have read groups A & B for a sample, but those read groups come to you in two different BAM files which you convert separately into ADAM (because they arrive at two separate times)... my understanding of the code is that both A & B would be assigned a recordGroupId of 0.

And then BQSR wouldn't distinguish reads from the two read groups as having different covariates too, right?

That's the main problem, I think. Longer term, I suspect that there will be people who will want to run BQSR or some updated version of it across multiple samples -- but again, you get the same problem if they're converted separately using the current code.

This issue is more a reminder, that we should think about how to address this problem in the future. It can be gotten around if you pre-merge the BAMs before conversion, but we should be clear that you need to do that. Otherwise, I assume you won't see an error, just a subtle mis-correction of base quality scores.

Please correct me if I'm wrong, though...

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Jan 22, 2014

Contributor

Actually, I guess the discussion was in the pull request: #48 (comment)

on an older version of the commit.

Contributor

tdanford commented Jan 22, 2014

Actually, I guess the discussion was in the pull request: #48 (comment)

on an older version of the commit.

@tdanford tdanford added the pick me up! label Jun 8, 2014

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Jun 8, 2014

Contributor

@fnothaft Where did we leave this? It's been a few months since we talked about it, I still believe that multi-BAM (and especially multi-Parquet) BQSR is a definite requirement.

Contributor

tdanford commented Jun 8, 2014

@fnothaft Where did we leave this? It's been a few months since we talked about it, I still believe that multi-BAM (and especially multi-Parquet) BQSR is a definite requirement.

@fnothaft fnothaft added the wontfix label Jul 20, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 20, 2016

Member

Closing as won't fix. The current BQSR codebase does support BQSR over a union of RDDs, which is generally sufficient for this issue, but also, we don't recommend running multi-sample BQSR. It's not that we recommend against it, but rather there's no benefit to running multi-sample BQSR instead of single-sample BQSR due to how the recalibration model is trained.

Member

fnothaft commented Jul 20, 2016

Closing as won't fix. The current BQSR codebase does support BQSR over a union of RDDs, which is generally sufficient for this issue, but also, we don't recommend running multi-sample BQSR. It's not that we recommend against it, but rather there's no benefit to running multi-sample BQSR instead of single-sample BQSR due to how the recalibration model is trained.

@fnothaft fnothaft closed this Jul 20, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment