-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] When removing duplicates, MarkDuplicates does not remove unmapped records whose mate is a duplicate #451
Labels
Comments
To follow up, the resulting file does not validate.
|
This is in my opinion a minor bug given that my recent tutorial on MarkDuplicates recommends keeping all reads and not removing duplicates consistent with a lossless operating procedure. |
I think that this is no longer the case when the input file is queryname sorted. @sooheelee can you please check to see if that is good enough? |
Open
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When the mapped mate of a singly mapping pair is marked as duplicate, its unmapped mate is not marked as duplicate. This is expected behavior as MarkDuplicates ignores unmapped reads from consideration.
If, however, I add 'REMOVE_DUPLICATES'=true to the MarkDuplicates command, the singly mapped duplicates are removed but the unmapped mates remain in the resulting file. This is undesirable behavior and these unmapped mates should also be removed.
In
6747_snippet_piped_markduplicates.bam
, in which duplicates are marked according to Tutorial#6747 using MarkDuplicates, I have 255 singly mapping reads that are also duplicate (SAM flag1033
) and no unmapped reads marked as duplicate (SAM flag1029
).Using the following command, I get a list of read names for duplicate singly mapping records:
Let's use the last record
H0164ALXX140820:2:2224:31825:66180
as the case in point. This is what the record looks like in6747_snippet_piped_markduplicates.bam
:Here's the example command to remove the duplicates and results from grepping for two of the records. As you can see, the unmapped mate of the duplicate singly mapped read remains.
In fact, doing a simple count shows a difference of 255 reads in the mate-unmapped category but no difference in unmapped reads category:
This is undesirable behavior, or at least inconsistent behavior. This bug report is from my personal testing.
The text was updated successfully, but these errors were encountered: