Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DEDUPLICATE_RECORDS option to RevertSam. #1029

Merged
merged 1 commit into from Dec 14, 2017

Conversation

nh13
Copy link
Collaborator

@nh13 nh13 commented Dec 13, 2017

Description

In some cases, the same record may be found multiple times in a BAM
file. This option allows the user to remove the duplicate record,
whereas the default behavior is to discard the records (and mates if
present).

Checklist

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

@nh13 nh13 force-pushed the nh_revert_sam_deduplicate_records branch from 6869acf to 175984e Compare December 13, 2017 18:09
@coveralls
Copy link

coveralls commented Dec 13, 2017

Coverage Status

Coverage increased (+0.05%) to 77.265% when pulling 175984e on nh_revert_sam_deduplicate_records into 2bb2714 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 77.255% when pulling 175984e on nh_revert_sam_deduplicate_records into 2bb2714 on master.

@@ -164,6 +167,10 @@
"the program will exit with an Exception instead of exiting cleanly. Output BAM will still be valid.")
public double MAX_DISCARD_FRACTION = 0.01;

@Argument(doc = "If SANITIZE=true discard duplicate records. Duplicate records will have the same values for all field" +
"including tags.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you probably need to add a note here that "deduplicate" doesn't have anything do with the usual "is duplicate flag" definition of the word. Maybe we can find a better word to use instead?

// Remove records that have the SAM SAMString (*** SLOW ***)
if (DEDUPLICATE_RECORDS) {
final Iterator<SAMRecord> iter = recs.iterator();
final Set<String> samStrings = new HashSet<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm unclear why this is better than just requiring that there be only a single unpaired/R1/R2 in the collection. That's tested below, so even if they are not identical, it'll still fail if there are, e.g. two R1s in the collection.
  2. I'd be a little concerned that if the extended attributes are in a different order they might generate different strings while otherwise being identical. Is SAMRecord.equals() not trustworthy for some reason?

@nh13 nh13 force-pushed the nh_revert_sam_deduplicate_records branch from 2af221b to a7bfa5d Compare December 13, 2017 21:49
@nh13
Copy link
Collaborator Author

nh13 commented Dec 13, 2017

@tfenne thanks for the quick review. I implemented your suggestion to keep only the first read for R1/R2/unpaired respectively. I improved the error message a little too.

@nh13 nh13 force-pushed the nh_revert_sam_deduplicate_records branch 2 times, most recently from 9fb5fa9 to 93ac7b4 Compare December 13, 2017 21:51
@coveralls
Copy link

coveralls commented Dec 13, 2017

Coverage Status

Coverage increased (+0.03%) to 77.24% when pulling 93ac7b4 on nh_revert_sam_deduplicate_records into 2bb2714 on master.

@coveralls
Copy link

coveralls commented Dec 13, 2017

Coverage Status

Coverage increased (+0.03%) to 77.24% when pulling 93ac7b4 on nh_revert_sam_deduplicate_records into 2bb2714 on master.

@coveralls
Copy link

coveralls commented Dec 13, 2017

Coverage Status

Coverage increased (+0.03%) to 77.24% when pulling 93ac7b4 on nh_revert_sam_deduplicate_records into 2bb2714 on master.

@coveralls
Copy link

coveralls commented Dec 13, 2017

Coverage Status

Coverage increased (+0.03%) to 77.24% when pulling 93ac7b4 on nh_revert_sam_deduplicate_records into 2bb2714 on master.

Copy link
Collaborator

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @nh13. Thanks!

discarded += recs.size() - 2;
recs = Arrays.asList(firstRecord, secondRecord);
} else {
log.debug("Discarding " + recs.size() + " reads with name " + recs.get(0).getReadName() + " because we found " + firsts + " R1s " + seconds + " R2s and " + unpaired + " unpaired reads.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use commas here instead of + so that all the string concatenation is only done if the log level is debug.

In some cases, the same record may be found multiple times in a BAM
file.  This option allows the user to to keep only the first record
encountered for R1, R2, and unpaired reads respectively.
@nh13 nh13 force-pushed the nh_revert_sam_deduplicate_records branch from 93ac7b4 to 7f4832b Compare December 14, 2017 18:23
@coveralls
Copy link

coveralls commented Dec 14, 2017

Coverage Status

Coverage increased (+0.03%) to 77.24% when pulling 7f4832b on nh_revert_sam_deduplicate_records into 2bb2714 on master.

@nh13 nh13 merged commit 87b58c7 into master Dec 14, 2017
@nh13 nh13 deleted the nh_revert_sam_deduplicate_records branch December 14, 2017 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants