-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MarkDuplicatesSpark error when FASTQ headers contain another @ in string #8134
Comments
@madisonjordan Thanks for reporting. That's certainly unexpected. It looks like that read name doesn't conform to the expectation of the read name elements being separated by Another thing to check. You say "FASTQ headers contain another @". The input to SparkDuplicates is the bam though. C you check that that line actually made it into the bam? I.e. is there a read with the readname "HWI-ST700660_163:1:1101:1243:1870#1@0/1"? In any case it sounds like a bug, but hopefully you can work around it for now. |
Hi @lbergelson. Thanks for the quick response! I believe we started with a FASTQ file that had the header I listed above: Which we later converted to a bam using samtools that contains this header: My team thinks it might be the Thank you for the suggestions! We will try that out so we can try to avoid detecting and changing each header. We really appreciate it! |
We'll have to dig into the root problem, it's probably a dumb bug with spliting strings somewhere is my guess but I don't think I'll be able to get to it until after the holidays. Hopefully a workaround can get you unstuck for now! |
No worries on the timeline. I definitely appreciate your willingness to look into it and considering a fix. I know how unexpectedly time consuming it can become even if it seems simple, especially with other tasks and responsibilities. I hope you enjoy your holidays! |
Greetings Louis! Hope you and the GATK team had a wonderful winter recess! Just checking-in if someone will be looking into this sometime soon? Asking since I'm running a few different datasets through our workflow, one of them being this same ICGC dataset with the header issue, and ideally would love to process everything the same way 😄. |
Bug Report
Affected tool(s) or class(es)
MarkDuplicatesSpark
Affected version(s)
Description
Headers with another
@
character fail to create a valid bam using MarkDuplicatesSpark. The bam file is empty. But the header will work when using samtools markdup instead. The following example was found in one of many samples we found in ICGC datasets.Example header:
@HWI-ST700660_163:1:1101:1243:1870#1@0/1
Log:
(removed some content since it was too long)
Steps to reproduce
Command:
Expected behavior
Finish MarkDuplicatesSpark successfully and output a valid bam file.
Actual behavior
The bam file is empty.
The text was updated successfully, but these errors were encountered: