Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools that expect unaligned reads shouldn't validate the sequence dictionary #4131

Closed
lbergelson opened this issue Jan 11, 2018 · 2 comments · Fixed by #4308
Closed

tools that expect unaligned reads shouldn't validate the sequence dictionary #4131

lbergelson opened this issue Jan 11, 2018 · 2 comments · Fixed by #4308

Comments

@lbergelson
Copy link
Member

lbergelson commented Jan 11, 2018

Tools that take an unaligned bam shouldn't expect that the bam has contigs that match the reference.

These include BwaAndMarkDuplicatesSpark, BwaSpark, and ReadsPipelineSpark (when running in alignment mode.)

@nyl2002
Copy link

nyl2002 commented Jan 16, 2018

Are the errors below part of this, when starting BwaSpark with spark-submit?
I activated "--disable-sequence-dictionary-validation true", but that doesn't help.

It is very unclear, why a BAM is not recognized as a BAM file. I have tried all kinds of ways to make sure that it is a BAM and not a SAM file.
The documentation for BwaSpark also says "BAM/SAM/CRAM file containing reads", so if SAM files are really not possible, that should probably be changed.
...
Even on verbosity DEBUG, the comments are not at all helpful to get at the problem.
E.g. "Cannot retrieve file pointers within SAM text files."
Is that a general statement about SAM files? Or does it only say, that in this specific SAM file (which is actually a BAM file), file pointers cannot be found?
What pointers are meant exactly?
How could this be fixed?

"SamReaderFactory	Unable to detect file format from input URL or stream, assuming SAM format."
Which URL?
Which stream?
Why would this happen? What could be the error?
The SAM/BAM distinction seems very unclear. It would be more helpful, if some specific missing aspect (e.g. not queryname sorted) would be clearly declared as the culprit.
...
00:29 DEBUG: [kryo] Write: SAMFileHeader{VN=1.5, SO=queryname}
...
WARNING	2018-01-16 02:11:25	SamReaderFactory	Unable to detect file format from input URL or stream, assuming SAM format.
...
java.lang.UnsupportedOperationException: Cannot retrieve file pointers within SAM text files.
	at htsjdk.samtools.SAMTextReader.getFilePointerSpanningReads(SAMTextReader.java:185)
...

@lbergelson
Copy link
Member Author

@nyl2002 What's your command line that's hitting problems? Are you trying to run BWA-MEM spark on a SAM file or on a BAM file?

I agree that we should change documentation and produce a better error message if it's failing on SAM files.

lbergelson added a commit that referenced this issue Jan 30, 2018
previously, tools that align reads required you to manually disable sequence dictionary validation
if you didn't, they would fail because the unaligned bam didn't have the required sequence dictionary

extracting out a SequenceDictionaryValidationArgumentCollection and providing a method for GATKSparkTools to configure it
ReadsPipeline couldn't easily make use of this, so instead it overrides the method that does validation

BwaSpark / BwaAndMarkDuplicatesPipelineSpark now do not require or allow dictionary validation
fixes #4131
lbergelson added a commit that referenced this issue May 9, 2018
previously, tools that align reads required you to manually disable sequence dictionary validation
if you didn't, they would fail because the unaligned bam didn't have the required sequence dictionary

extracting out a SequenceDictionaryValidationArgumentCollection and providing a method for GATKSparkTools to configure it
ReadsPipeline couldn't easily make use of this, so instead it overrides the method that does validation

BwaSpark / BwaAndMarkDuplicatesPipelineSpark now do not require or allow dictionary validation
fixes #4131
lbergelson added a commit that referenced this issue May 14, 2018
previously, tools that align reads required you to manually disable sequence dictionary validation
if you didn't, they would fail because the unaligned bam didn't have the required sequence dictionary

extracting out a SequenceDictionaryValidationArgumentCollection and providing a method for GATKSparkTools to configure it
ReadsPipeline couldn't easily make use of this, so instead it overrides the method that does validation

BwaSpark / BwaAndMarkDuplicatesPipelineSpark now do not require or allow dictionary validation
fixes #4131
lbergelson added a commit that referenced this issue May 17, 2018
previously, tools that align reads required you to manually disable sequence dictionary validation
if you didn't, they would fail because the unaligned bam didn't have the required sequence dictionary

extracting out a SequenceDictionaryValidationArgumentCollection and providing a method for GATKSparkTools to configure it
ReadsPipeline couldn't easily make use of this, so instead it overrides the method that does validation

BwaSpark / BwaAndMarkDuplicatesPipelineSpark now do not require or allow dictionary validation
fixes #4131
lbergelson added a commit that referenced this issue May 21, 2018
* previously, tools that align reads required you to manually disable sequence dictionary validation
    if you didn't, they would fail because the unaligned bam didn't have the required sequence dictionary

*  extracting out a SequenceDictionaryValidationArgumentCollection and providing a method for GATKSparkTools to configure it
    ReadsPipeline couldn't easily make use of this, so instead it overrides the method that does validation

*   BwaSpark / BwaAndMarkDuplicatesPipelineSpark now do not require or allow dictionary validation
*   fixes #4131
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants