[Bug] When removing duplicates, MarkDuplicates does not remove unmapped records whose mate is a duplicate #451

sooheelee · 2016-02-18T16:20:10Z

When the mapped mate of a singly mapping pair is marked as duplicate, its unmapped mate is not marked as duplicate. This is expected behavior as MarkDuplicates ignores unmapped reads from consideration.

If, however, I add 'REMOVE_DUPLICATES'=true to the MarkDuplicates command, the singly mapped duplicates are removed but the unmapped mates remain in the resulting file. This is undesirable behavior and these unmapped mates should also be removed.

This bug applies to MarkDuplicates and likely MarkDuplicatesWithMateCigar as well.
This bug pertains to singly mapping pairs as defined in this dictionary entry. I tested this and Tutorial#6747 explains its example data that you can use to recapitulate the bug.
This bug could also potentially apply to secondary records whose primary mappings are marked as duplicate. I did not test this scenerio but recommend this be addressed as well.
This bug could also potentially apply to any tool that considers and removes singly mapping reads.

In 6747_snippet_piped_markduplicates.bam, in which duplicates are marked according to Tutorial#6747 using MarkDuplicates, I have 255 singly mapping reads that are also duplicate (SAM flag 1033) and no unmapped reads marked as duplicate (SAM flag 1029).

-bash-3.2$ samtools view -c -f 1033 6747_snippet_piped_markduplicates.bam
255
-bash-3.2$ samtools view -c -f 1029 6747_snippet_piped_markduplicates.bam
0

Using the following command, I get a list of read names for duplicate singly mapping records:

-bash-3.2$ samtools view -f 1033 6747_snippet_piped_markduplicates.bam | cut -f1 | sort | uniq
H0164ALXX140820:2:1101:1944:34606
H0164ALXX140820:2:1101:20356:2293
. [truncated]
H0164ALXX140820:2:2224:31389:66479
H0164ALXX140820:2:2224:31825:66180

Let's use the last record H0164ALXX140820:2:2224:31825:66180 as the case in point. This is what the record looks like in 6747_snippet_piped_markduplicates.bam:

H0164ALXX140820:2:2224:31825:66180  1113    10  96663263    60  72S79M  =   96663263    0   TGTTGTCGTTGCGTGTTTGGGGGGTGGGTTGGTTTTGCCGTTAGGTAGGGTGAGTGGGTCGGAAGTGCTGACAAACAGCAAGAGTAATATAGTTAATACACTCCCCTGAAAACCTGCTAAAAGAGTAGATTTTAGGTGCTTATACCACGAA #############################################################################AAJ-FAJFA<7F7JA-JA---J-J<-JJ-JFF-A-FJJ-JAFJF--FJF<FFJJAAFFJFJJJJJJFJJFFAFA MD:Z:30A48  PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:1  UQ:i:12 AS:i:74 XS:i:36
H0164ALXX140820:2:2224:31825:66180  165 10  96663263    0   *   =   96663263    0   ACCACCAAACCAAAAAACGACACAACCCCCCCCACACCAACACCAAAAAAACAACCACCAAACAAAACCAACACACGAAGACACGACAAGACAGGGCAGTAAAGACAGCAAGTGTAAAGCAACGGAACAGCACCAGACAGAGTAGTGAAGT AAA---<<7-AF---<F7---<---JA-F-F<<-FF<<<F############################################################################################################### MC:Z:72S79M PG:Z:MarkDuplicates RG:Z:H0164.2

Here's the example command to remove the duplicates and results from grepping for two of the records. As you can see, the unmapped mate of the duplicate singly mapped read remains.

-bash-3.2$ java -jar $PICARD MarkDuplicates INPUT=6483_snippet_piped.bam OUTPUT=snippet_removeduplicates_markduplicates.bam METRICS_FILE=snippet_removeduplicates_markduplicates_metrics.txt OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 REMOVE_DUPLICATES=true CREATE_INDEX=true
-bash-3.2$ samtools view snippet_removeduplicates_markduplicates.bam | grep 'H0164ALXX140820:2:2224:31825:66180'
H0164ALXX140820:2:2224:31825:66180  165 10  96663263    0   *   =   96663263    0   ACCACCAAACCAAAAAACGACACAACCCCCCCCACACCAACACCAAAAAAACAACCACCAAACAAAACCAACACACGAAGACACGACAAGACAGGGCAGTAAAGACAGCAAGTGTAAAGCAACGGAACAGCACCAGACAGAGTAGTGAAGT AAA---<<7-AF---<F7---<---JA-F-F<<-FF<<<F############################################################################################################### MC:Z:72S79M PG:Z:MarkDuplicates RG:Z:H0164.2
-bash-3.2$ samtools view snippet_removeduplicates_markduplicates.bam | grep 'H0164ALXX140820:2:2224:31389:66479'
H0164ALXX140820:2:2224:31389:66479  165 10  96663251    0   *   =   96663251    0   ATCTTGAAACTTAACAAGAAGATACTGCCGTTTGCATCAGAATGGAATGAAAATGGAGGATCATATGATAAGTGAAATAACCACGACAATGATGAAAACGCCTCGAACCATAACAACGCAGAAACGAATCATAAAAAGAATAAACAAGAGG <--<--AF<--AFA-J7J-J-A<A------F-AF--A-A--7-J-7-<-FA-<--F-J-7----A-<A-7<A-JA--A-<<-A-F################################################################## MC:Z:60S91M PG:Z:MarkDuplicates RG:Z:H0164.2

In fact, doing a simple count shows a difference of 255 reads in the mate-unmapped category but no difference in unmapped reads category:

-bash-3.2$ samtools view -c -f 4 snippet_removeduplicates_markduplicates.bam
1211
-bash-3.2$ samtools view -c -f 8 snippet_removeduplicates_markduplicates.bam
2380
-bash-3.2$ samtools view -c -f 4 6483_snippet_piped.bam
1211
-bash-3.2$ samtools view -c -f 8 6483_snippet_piped.bam
2635

This is undesirable behavior, or at least inconsistent behavior. This bug report is from my personal testing.

The text was updated successfully, but these errors were encountered:

sooheelee · 2016-02-18T16:42:03Z

To follow up, the resulting file does not validate.

-bash-3.2$ java -jar $PICARD ValidateSamFile INPUT=snippet_removeduplicates_markduplicates.bam MODE=SUMMARY
[Thu Feb 18 11:40:32 EST 2016] picard.sam.ValidateSamFile INPUT=snippet_removeduplicates_markduplicates.bam MODE=SUMMARY    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Feb 18 11:40:32 EST 2016] Executing as shlee@gsa3.broadinstitute.org on Linux 2.6.18-371.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132; Picard version: 1.1010(a4d54b4cc728030c92afe147fb013f1f2a9a0a39_1455717603) IntelDeflater


## HISTOGRAM    java.lang.String
Error Type  Count
ERROR:MATE_NOT_FOUND    255

[Thu Feb 18 11:40:39 EST 2016] picard.sam.ValidateSamFile done. Elapsed time: 0.11 minutes.
Runtime.totalMemory()=2352480256
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

sooheelee · 2016-02-21T17:30:27Z

This is in my opinion a minor bug given that my recent tutorial on MarkDuplicates recommends keeping all reads and not removing duplicates consistent with a lossless operating procedure.

yfarjoun · 2018-05-25T20:01:09Z

I think that this is no longer the case when the input file is queryname sorted. @sooheelee can you please check to see if that is good enough?

sooheelee added the bug label Feb 18, 2016

NataliaLeznikova mentioned this issue Dec 7, 2017

Fix secondary, supplementary and unmapped duplicate reads removing #1018

Closed

yfarjoun closed this as completed Mar 15, 2019

AlijahArcher mentioned this issue Nov 20, 2020

MarkDuplicates should mark unmapped reads as duplicates when their mapped mates are marked. #1138

Open

2 tasks

mathiasbio mentioned this issue Feb 22, 2023

Optimise mergeBam rule Clinical-Genomics/BALSAMIC#1053

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] When removing duplicates, MarkDuplicates does not remove unmapped records whose mate is a duplicate #451

[Bug] When removing duplicates, MarkDuplicates does not remove unmapped records whose mate is a duplicate #451

sooheelee commented Feb 18, 2016

sooheelee commented Feb 18, 2016

sooheelee commented Feb 21, 2016

yfarjoun commented May 25, 2018

[Bug] When removing duplicates, MarkDuplicates does not remove unmapped records whose mate is a duplicate #451

[Bug] When removing duplicates, MarkDuplicates does not remove unmapped records whose mate is a duplicate #451

Comments

sooheelee commented Feb 18, 2016

sooheelee commented Feb 18, 2016

sooheelee commented Feb 21, 2016

yfarjoun commented May 25, 2018