Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarkDuplicatesWithMateCigar errors on read ordering w/ sorted input. #327

Closed
MurphyMarkW opened this issue Nov 5, 2015 · 8 comments
Closed

Comments

@MurphyMarkW
Copy link

My group ran across this while running some TCGA datasets and I was hoping someone might have some deeper insight into what's going on. The files we're working with are (very) large and non-public. Because of that, I boiled down the inputs while maintaining the error so that I could share them.

It appears that picard MarkDuplicatesWithMateCigar will complain about reads being out of order when run with certain options and supplied with certain sorted inputs. The input sam passes without error strict ValidateSamFile, and will actually pass through MarkDuplicatesWithMateCigar perfectly fine depending on the settings used in the command.

The input I'm currently using to produce the error is:

@HD VN:1.5  SO:coordinate
@SQ SN:chr1 LN:248956422
@RG ID:0    SM:0    PL:ILLUMINA
QNAME0  67  chr1    1   0   101M    =   2   102 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:101M   RG:Z:0  NM:i:0
QNAME0  131 chr1    2   0   101M    =   1   -102    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:101M   RG:Z:0  NM:i:0
QNAME1  115 chr1    115 2   1S100M  =   116 102 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:101M   RG:Z:0  NM:i:0
QNAME1  179 chr1    116 2   101M    =   115 -102    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:1S100M RG:Z:0  NM:i:0

Strict ValidateSamFile for good measure:

java -jar picard-tools-1.140/picard.jar ValidateSamFile INPUT=sorting_error.sam VALIDATION_STRINGENCY=STRICT VERBOSITY=DEBUG
[Thu Nov 05 19:42:30 UTC 2015] picard.sam.ValidateSamFile INPUT=sorting_error.sam VERBOSITY=DEBUG VALIDATION_STRINGENCY=STRICT    MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Nov 05 19:42:30 UTC 2015] Executing as ubuntu@mwm-thelab on Linux 3.13.0-62-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_79-b14; Picard version: 1.140(a81bc82e781dae05c922d1dbcee737334612399f_1444244284) IntelDeflater
DEBUG   2015-11-05 19:42:30 Option  Ignoring VALIDATE_CRC_CHECKSUMS option; does not apply to PrimitiveSamReaderToSamReaderAdapter readers.
DEBUG   2015-11-05 19:42:30 Option  Ignoring VALIDATE_CRC_CHECKSUMS option; does not apply to PrimitiveSamReaderToSamReaderAdapter readers.
No errors found
[Thu Nov 05 19:42:30 UTC 2015] picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=252182528

And the command that will produce the error is:

java -jar picard-tools-1.140/picard.jar MarkDuplicatesWithMateCigar INPUT=sorting_error.sam OUTPUT=dup METRICS_FILE=metrics.dup TMP_DIR=. MAX_RECORDS_IN_RAM=2 MINIMUM_DISTANCE=212 VALIDATION_STRINGENCY=STRICT VERBOSITY=DEBUG
[Thu Nov 05 19:39:19 UTC 2015] picard.sam.markduplicates.MarkDuplicatesWithMateCigar MINIMUM_DISTANCE=212 INPUT=[sorting_error.sam] OUTPUT=dup METRICS_FILE=metrics.dup TMP_DIR=[.] VERBOSITY=DEBUG VALIDATION_STRINGENCY=STRICT MAX_RECORDS_IN_RAM=2    SKIP_PAIRS_WITH_NO_MATE_CIGAR=true BLOCK_SIZE=100000 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicatesWithMateCigar REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=TOTAL_MAPPED_REFERENCE_LENGTH READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 QUIET=false COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Nov 05 19:39:19 UTC 2015] Executing as ubuntu@mwm-thelab on Linux 3.13.0-62-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_79-b14; Picard version: 1.140(a81bc82e781dae05c922d1dbcee737334612399f_1444244284) IntelDeflater
DEBUG   2015-11-05 19:39:19 Option  Ignoring EAGERLY_DECODE option; does not apply to PrimitiveSamReaderToSamReaderAdapter readers.
WARNING 2015-11-05 19:39:19 AbstractOpticalDuplicateFinderCommandLineProgram    Default READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did not match read name 'QNAME0'.  You may need to specify a READ_NAME_REGEX in order to correctly identify optical duplicates.  Note that this message will not be emitted again even if other read names do not match the regex.
WARNING 2015-11-05 19:39:19 MarkDuplicatesWithMateCigar Encountered a record with no program record, program group chaining will not occur for this read: QNAME0 1/2 101b aligned read.
Previous record: QNAME1 179 chr1    116 2   101M    =   115 -102    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:1S100M RG:Z:0  NM:i:0
Current record:QNAME1   115 chr1    115 2   1S100M  =   116 102 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:101M   RG:Z:0  NM:i:0
[Thu Nov 05 19:39:19 UTC 2015] picard.sam.markduplicates.MarkDuplicatesWithMateCigar done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=252182528
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" picard.PicardException: Records were not found coordinate sort order
    at picard.sam.markduplicates.MarkDuplicatesWithMateCigarIterator.next(MarkDuplicatesWithMateCigarIterator.java:228)
    at picard.sam.markduplicates.MarkDuplicatesWithMateCigarIterator.next(MarkDuplicatesWithMateCigarIterator.java:47)
    at picard.sam.markduplicates.MarkDuplicatesWithMateCigar.doWork(MarkDuplicatesWithMateCigar.java:132)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Note the MAX_RECORDS_IN_RAM=2 and MINIMUM_DISTANCE=212. We came across this error by chance on a much larger file, and using the option MAX_RECORDS_IN_RAM=50000 w/ MINIMUM_DISTANCE the same. It seems like these two options are required to cause the error, but the exact value that is required to cause the error depends upon the content of the records. There may be other values where the error occurs, but small deviations around these numbers seems to allow the file to pass through without error. For example, with MAX_RECORDS_IN_RAM=3, or any higher value for that matter, we get the following:

java -jar picard-tools-1.140/picard.jar MarkDuplicatesWithMateCigar INPUT=sorting_error.sam OUTPUT=dup METRICS_FILE=metrics.dup TMP_DIR=. MAX_RECORDS_IN_RAM=3 MINIMUM_DISTANCE=212 VALIDATION_STRINGENCY=STRICT VERBOSITY=DEBUG
[Thu Nov 05 19:47:03 UTC 2015] picard.sam.markduplicates.MarkDuplicatesWithMateCigar MINIMUM_DISTANCE=212 INPUT=[sorting_error.sam] OUTPUT=dup METRICS_FILE=metrics.dup TMP_DIR=[.] VERBOSITY=DEBUG VALIDATION_STRINGENCY=STRICT MAX_RECORDS_IN_RAM=3    SKIP_PAIRS_WITH_NO_MATE_CIGAR=true BLOCK_SIZE=100000 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicatesWithMateCigar REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=TOTAL_MAPPED_REFERENCE_LENGTH READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 QUIET=false COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Nov 05 19:47:03 UTC 2015] Executing as ubuntu@mwm-thelab on Linux 3.13.0-62-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_79-b14; Picard version: 1.140(a81bc82e781dae05c922d1dbcee737334612399f_1444244284) IntelDeflater
DEBUG   2015-11-05 19:47:03 Option  Ignoring EAGERLY_DECODE option; does not apply to PrimitiveSamReaderToSamReaderAdapter readers.
WARNING 2015-11-05 19:47:03 AbstractOpticalDuplicateFinderCommandLineProgram    Default READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did not match read name 'QNAME0'.  You may need to specify a READ_NAME_REGEX in order to correctly identify optical duplicates.  Note that this message will not be emitted again even if other read names do not match the regex.
WARNING 2015-11-05 19:47:03 MarkDuplicatesWithMateCigar Encountered a record with no program record, program group chaining will not occur for this read: QNAME0 1/2 101b aligned read.
INFO    2015-11-05 19:47:03 MarkDuplicatesWithMateCigar Processed 4 records
INFO    2015-11-05 19:47:03 MarkDuplicatesWithMateCigar Found 0 records with no mate cigar optional tag.
INFO    2015-11-05 19:47:03 MarkDuplicatesWithMateCigar Marking 0 records as duplicates.
INFO    2015-11-05 19:47:03 MarkDuplicatesWithMateCigar Found 0 optical duplicate clusters.
[Thu Nov 05 19:47:03 UTC 2015] picard.sam.markduplicates.MarkDuplicatesWithMateCigar done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=252182528

My initial thought was that I was missing some subtle detail of the sam spec, but given that the sam input passes ValidateSamFile, and will pass through without error when supplied with different parameters, albeit with warnings about formatting of the QNAME and lack of PG tags, it seems less likely...

Does this make sense to anyone? Any thoughts or suggestions would be greatly appreciated!

@nh13
Copy link
Collaborator

nh13 commented Nov 9, 2015

@MurphyMarkW myself and @bradtaylor wrote this version of MarkDuplicates, so loop us in as we go along. [EDIT: you posted the SAM]. By the way, those are some ugly alignments and basecalls.

@nh13
Copy link
Collaborator

nh13 commented Nov 9, 2015

Developer note: Here's the bug in a test case for DiskBackedQueueTest. Likely, once we spill to disk, we cannot add records to ram.

   /** See: https://github.com/broadinstitute/picard/issues/327 */
    @Test
    public void testPathologyIssue327() {
        final File tmpDir = new File("testdata/htsjdk/samtools/util");
        if (!tmpDir.exists()) tmpDir.mkdir();
        tmpDir.deleteOnExit();

        final DiskBackedQueue<String> queue = DiskBackedQueue.newInstance(
                new StringCodec(),
                2, // maxRecordsInRam
                Arrays.asList(tmpDir)
        );

        // testing a particular order of adding to the queue, setting the result state, and emitting.
        queue.add("0");
        queue.add("1");
        queue.add("2");
        Assert.assertEquals(queue.poll(), "0");
        queue.add("3");
        Assert.assertEquals(queue.poll(), "1");
        Assert.assertEquals(queue.poll(), "2");
        Assert.assertEquals(queue.poll(), "3");
    }

@nh13
Copy link
Collaborator

nh13 commented Nov 9, 2015

@MurphyMarkW fixed in samtools/htsjdk#376

@nh13
Copy link
Collaborator

nh13 commented Nov 9, 2015

@bradtaylor, I said this over email, but it is here for posterity

Here's an idea that should allow you to never have to worry about "canAdd" (get rid of it) in DiskBackedQueue.

Have four internal buffers

  1. primaryRamQueue
  2. primaryDiskQueue
  3. secondaryRamQueue
  4. secondaryDiskQueue
  • You put records into #1 if there is space, #2 if #1 is full.
  • When you poll from Flush sorting collections when a barcode within a tile has completed to ... #1 and there are records in #2, then you can add into either #3 or #4 (enforce # of records in RAM across #1 & #3).
  • If #1 is empty but #2 is not, poll returns from #2, and add puts in #3 or #4 (ram limited).
  • When both #1 and #2 are both empty, you can swap #1&#3, and #2&#4.
  • Enforce that size (#1+#3) is less than the # of records allowed in RAM

In any case, it is probably a good idea to first think if 3-5 test cases that test out the various switching to the above four queues. Have maxRecordsInRam be something like 2. Also, get rid of the "headRecord" thing, as it is confusing.

@MurphyMarkW
Copy link
Author

@nh13 @bradtaylor \o/ Thank you for taking the time to look into this! I didn't think to look in htsjdk for the source of the bug.

I'd be happy to test out changes on our side and report back. And agreed re: ugly calls @nh13. Doctored records due to protected sequences and all that jazz. :P

@yfarjoun
Copy link
Contributor

What's the status of this issue? is this still a bug?

@MurphyMarkW
Copy link
Author

@yfarjoun it's been a while but I believe there was a fix applied for this specific case and that it worked for our test cases (specifically the one given above). I believe my group ran into a similar-looking issue, but one that we couldn't reliably reproduce, so there may be something else lurking about? I'd have to dig through our logs for a dataset that reproduces it, but if you or someone else has seen something, I might see about setting aside some time to look into it.

From my side, I'd say this particular bug is resolved and that the issue could be closed. Apologies for not responding more promptly with these results. Kind of fell off the radar.

@yfarjoun
Copy link
Contributor

no worries. @MurphyMarkW just following up to see if can close this issue. please reopen or open a new issue if this raises its head again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants