Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
BQSR should support recalibration across multiple ADAM files #58
Per the discussion here: #48 (comment)
It doesn't appear that BQSR could currently be (correctly) run across multiple ADAM files with overlapping read groups (or possibly multiple ADAM files with different read groups either, I'm not sure).
We should probably (a) make that clearer in the documentation to BQSR, and (b) correct it (by building or maintaining a master read-group map that can be used to correctly define the read group covariate across files).
A question for you here: what exactly is the use case here? It seems to me that if you were going to run BQSR across data that was in multiple files, you would take the union of the files, then run BQSR on that union.
I'm not saying it's not done (as people have provided evidence that it is), but I don't entirely understand what the "correct" multi-file BQSR use model is, and would appreciate some clarification.
Frank, I'm assuming you've been watching the discussion on this commit: arahuja@f4e30a5
So if you have read groups A & B for a sample, but those read groups come to you in two different BAM files which you convert separately into ADAM (because they arrive at two separate times)... my understanding of the code is that both A & B would be assigned a recordGroupId of 0.
And then BQSR wouldn't distinguish reads from the two read groups as having different covariates too, right?
That's the main problem, I think. Longer term, I suspect that there will be people who will want to run BQSR or some updated version of it across multiple samples -- but again, you get the same problem if they're converted separately using the current code.
This issue is more a reminder, that we should think about how to address this problem in the future. It can be gotten around if you pre-merge the BAMs before conversion, but we should be clear that you need to do that. Otherwise, I assume you won't see an error, just a subtle mis-correction of base quality scores.
Please correct me if I'm wrong, though...
Closing as won't fix. The current BQSR codebase does support BQSR over a union of RDDs, which is generally sufficient for this issue, but also, we don't recommend running multi-sample BQSR. It's not that we recommend against it, but rather there's no benefit to running multi-sample BQSR instead of single-sample BQSR due to how the recalibration model is trained.