New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DEDUPLICATE_RECORDS option to RevertSam. #1029
Conversation
6869acf
to
175984e
Compare
@@ -164,6 +167,10 @@ | |||
"the program will exit with an Exception instead of exiting cleanly. Output BAM will still be valid.") | |||
public double MAX_DISCARD_FRACTION = 0.01; | |||
|
|||
@Argument(doc = "If SANITIZE=true discard duplicate records. Duplicate records will have the same values for all field" + | |||
"including tags.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you probably need to add a note here that "deduplicate" doesn't have anything do with the usual "is duplicate flag" definition of the word. Maybe we can find a better word to use instead?
// Remove records that have the SAM SAMString (*** SLOW ***) | ||
if (DEDUPLICATE_RECORDS) { | ||
final Iterator<SAMRecord> iter = recs.iterator(); | ||
final Set<String> samStrings = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm unclear why this is better than just requiring that there be only a single unpaired/R1/R2 in the collection. That's tested below, so even if they are not identical, it'll still fail if there are, e.g. two R1s in the collection.
- I'd be a little concerned that if the extended attributes are in a different order they might generate different strings while otherwise being identical. Is
SAMRecord.equals()
not trustworthy for some reason?
2af221b
to
a7bfa5d
Compare
@tfenne thanks for the quick review. I implemented your suggestion to keep only the first read for R1/R2/unpaired respectively. I improved the error message a little too. |
9fb5fa9
to
93ac7b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me @nh13. Thanks!
discarded += recs.size() - 2; | ||
recs = Arrays.asList(firstRecord, secondRecord); | ||
} else { | ||
log.debug("Discarding " + recs.size() + " reads with name " + recs.get(0).getReadName() + " because we found " + firsts + " R1s " + seconds + " R2s and " + unpaired + " unpaired reads."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just use commas here instead of +
so that all the string concatenation is only done if the log level is debug.
In some cases, the same record may be found multiple times in a BAM file. This option allows the user to to keep only the first record encountered for R1, R2, and unpaired reads respectively.
93ac7b4
to
7f4832b
Compare
Description
In some cases, the same record may be found multiple times in a BAM
file. This option allows the user to remove the duplicate record,
whereas the default behavior is to discard the records (and mates if
present).
Checklist
Content
Review