Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarkDuplicatesSpark error when FASTQ headers contain another @ in string #8134

Open
1 task
madisonjordan opened this issue Dec 19, 2022 · 5 comments
Open
1 task

Comments

@madisonjordan
Copy link

Bug Report

Affected tool(s) or class(es)

MarkDuplicatesSpark

Affected version(s)

  • GATK version 4.1.9.0

Description

Headers with another @ character fail to create a valid bam using MarkDuplicatesSpark. The bam file is empty. But the header will work when using samtools markdup instead. The following example was found in one of many samples we found in ICGC datasets.

Example header:
@HWI-ST700660_163:1:1101:1243:1870#1@0/1

Log:
(removed some content since it was too long)

00:05 DEBUG: [kryo] Read: Object[]
00:05 DEBUG: [kryo] Read: Object[]
00:05 DEBUG: [kryo] Read: Object[]
00:05 DEBUG: [kryo] Read: Object[]
00:05 DEBUG: [kryo] Write: Object[]
00:05 DEBUG: [kryo] Write: Object[]
00:05 DEBUG: [kryo] Write: Object[]
...
01:22 DEBUG: [kryo] Read: CompressedMapStatus
01:22 DEBUG: [kryo] Write: CompressedMapStatus
...
02:25 DEBUG: [kryo] Read: WrappedArray([])
02:25 DEBUG: [kryo] Write: WrappedArray([])
02:25 DEBUG: [kryo] Read: scala.Tuple3[]
02:25 DEBUG: [kryo] Read: scala.Tuple3[]
02:25 DEBUG: [kryo] Read: WrappedArray([])
02:25 DEBUG: [kryo] Read: WrappedArray([])
02:25 DEBUG: [kryo] Write: scala.Tuple3[]
...
02:42 DEBUG: [kryo] Write object reference 1941: HLA-A*24:152
02:42 DEBUG: [kryo] Write object reference 1945: chrUn_JTFH01001224v1_decoy
02:42 DEBUG: [kryo] Write object reference 1949: HLA-B*14:01:01
02:42 DEBUG: [kryo] Write object reference 1953: chr5_GL949742v1_alt
...
02:42 DEBUG: [kryo] Write object reference 1942: SAMSequenceRecord(name=HLA-A*24:152,length=3176,dict_index=2919,assembly=null,alternate_names=[])
02:42 DEBUG: [kryo] Write object reference 1946: SAMSequenceRecord(name=chrUn_JTFH01001224v1_decoy,length=1051,dict_index=2066,assembly=null,alternate_names=[])
02:42 DEBUG: [kryo] Write object reference 1950: SAMSequenceRecord(name=HLA-B*14:01:01,length=3312,dict_index=2999,assembly=null,alternate_names=[])
02:42 DEBUG: [kryo] Write object reference 1954: SAMSequenceRecord(name=chr5_GL949742v1_alt,length=226852,dict_index=241,assembly=null,alternate_names=[])
...
02:42 DEBUG: [kryo] Write: Array[java.lang.Object]
02:42 DEBUG: [kryo] Write: Object[]
02:42 DEBUG: [kryo] Write: byte[]
02:42 DEBUG: [kryo] Write: WrappedArray([])
...
03:20 DEBUG: [kryo] Read: CompressedMapStatus
03:20 DEBUG: [kryo] Write: Array[java.lang.Object]
03:20 DEBUG: [kryo] Write: Object[]
03:20 DEBUG: [kryo] Write: byte[]
03:20 DEBUG: [kryo] Read: Array[java.lang.Object]
03:20 DEBUG: [kryo] Read: Object[]
03:21 DEBUG: [kryo] Write: TaskCommitMessage
03:21 DEBUG: [kryo] Read: TaskCommitMessage

Steps to reproduce

Command:

gatk --java-options "-Djava.io.tmpdir=/scratch" MarkDuplicatesSpark --input C19CUACXX.1.1-1.sorted.bam --output /scratch/C19CUACXX.1.1.sorted.Spark.Strigency-strict.bam --conf 'spark.local.dir=/scratch' --tmp-dir /scratch --read-validation-stringency STRICT

Expected behavior

Finish MarkDuplicatesSpark successfully and output a valid bam file.

Actual behavior

The bam file is empty.

@lbergelson
Copy link
Member

@madisonjordan Thanks for reporting. That's certainly unexpected.

It looks like that read name doesn't conform to the expectation of the read name elements being separated by :. I don't know for sure what's going on but I suspect it's failing to parse the read names and falling over in a very bad way. (I'm shocked we haven't seen this before since unexpected readnames are not exactly rare, so it could be something else entirely.)
By default it assumes readnames are structured like this: "MYNAME:1:22:1193:123181". There is an option --READ_NAME_REGEX to provide a custom regex to parse with which might help in your case. You might also try setting that to null to disable optical duplicate marking.

Another thing to check. You say "FASTQ headers contain another @". The input to SparkDuplicates is the bam though. C you check that that line actually made it into the bam?

I.e. is there a read with the readname "HWI-ST700660_163:1:1101:1243:1870#1@0/1"?

In any case it sounds like a bug, but hopefully you can work around it for now.

@madisonjordan
Copy link
Author

Hi @lbergelson. Thanks for the quick response!

I believe we started with a FASTQ file that had the header I listed above:
HWI-ST700660_163:1:1101:1243:1870#1@0/1

Which we later converted to a bam using samtools that contains this header:
HWI-ST700660_163:1:1101:1243:1870#1@0
after alignment with BWA-MEM2.

My team thinks it might be the @ from the FASTQ header since modifying it to use a format without the additional @, such as @SRR5456220.1 1 length=101, seemed to allow it to work properly in our workflow.

Thank you for the suggestions! We will try that out so we can try to avoid detecting and changing each header. We really appreciate it!

@lbergelson
Copy link
Member

We'll have to dig into the root problem, it's probably a dumb bug with spliting strings somewhere is my guess but I don't think I'll be able to get to it until after the holidays. Hopefully a workaround can get you unstuck for now!

@madisonjordan
Copy link
Author

No worries on the timeline. I definitely appreciate your willingness to look into it and considering a fix. I know how unexpectedly time consuming it can become even if it seems simple, especially with other tasks and responsibilities.

I hope you enjoy your holidays!

@Alfredo-Enrique
Copy link

Greetings Louis! Hope you and the GATK team had a wonderful winter recess!

Just checking-in if someone will be looking into this sometime soon? Asking since I'm running a few different datasets through our workflow, one of them being this same ICGC dataset with the header issue, and ideally would love to process everything the same way 😄.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants