Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid and mismatch CIGAR after remapping BAM file #952

Closed
J-Moravec opened this issue Jun 24, 2020 · 6 comments
Closed

Invalid and mismatch CIGAR after remapping BAM file #952

J-Moravec opened this issue Jun 24, 2020 · 6 comments

Comments

@J-Moravec
Copy link

I got plenty of these errors after I have used samtools view as a --readFilesCommand option.

Sample errors from gatk ValidateSamFile:

ERROR::MISMATCH_CIGAR_SEQ_LENGTH:Record 639025, Read name NS500269:554:HHJGLBGX5:1:21101:2511:3916, CIGAR covers 79983780 bases but the sequ
ence is 56 read bases
ERROR::MISMATCH_CIGAR_SEQ_LENGTH:Record 3772313, Read name NS500269:554:HHJGLBGX5:1:12103:9897:14521, CIGAR covers 70600005 bases but the se
quence is 56 read bases
ERROR::INVALID_CIGAR:Record 3772313, Read name NS500269:554:HHJGLBGX5:1:12103:9897:14521, No real operator (M|I|D|N) in CIGAR
ERROR::MISMATCH_CIGAR_SEQ_LENGTH:Record 3772314, Read name NS500269:554:HHJGLBGX5:4:12608:7922:6889, CIGAR covers 74724677 bases but the seq
uence is 56 read bases
ERROR::INVALID_CIGAR:Record 3772314, Read name NS500269:554:HHJGLBGX5:4:12608:7922:6889, No real operator (M|I|D|N) in CIGAR
ERROR::MISMATCH_CIGAR_SEQ_LENGTH:Record 3772315, Read name NS500269:554:HHJGLBGX5:2:12111:17591:2975, CIGAR covers 74784772 bases but the se
quence is 56 read bases
ERROR::INVALID_CIGAR:Record 7614095, Read name NS500269:554:HHJGLBGX5:3:23504:4070:10746, No real operator (M|I|D|N) in CIGAR
ERROR::INVALID_CIGAR:Record 7614096, Read name NS500269:554:HHJGLBGX5:1:13209:20120:12740, No real operator (M|I|D|N) in CIGAR
ERROR::MISMATCH_CIGAR_SEQ_LENGTH:Record 16210944, Read name NS500269:554:HHJGLBGX5:3:11612:16545:9894, CIGAR covers 74740788 bases but the s
equence is 56 read bases
ERROR::MISMATCH_CIGAR_SEQ_LENGTH:Record 16212577, Read name NS500269:554:HHJGLBGX5:3:12511:1956:13694, CIGAR covers 53715797 bases but the s
equence is 56 read bases
ERROR::CIGAR_MAPS_OFF_REFERENCE:Record 16212577, Read name NS500269:554:HHJGLBGX5:3:12511:1956:13694, Read CIGAR M operator maps off end of
reference

There are no errors detected when running on the original BAM file.

Command used to run:

STAR --readFilesIn intermediate/BAM/HLR.tagfix.bam --readFilesType SAM SE --genomeDir intermediate/index --outFileNamePrefix intermediate/BAM/HLR. --runThreadN 12 --outSAMtype BAM SortedByCoordinate --twopassMode Basic --readFilesCommand samtools view

(tagfix signify that pysamtools were to remove unwanted tags, see #939; this file was validated and found no errors in it)

Log file:
https://termbin.com/mlk5

@alexdobin
Copy link
Owner

Hi Jiří

not sure what's going on here, Log file does not have anything suspicious.
could you please grep HLR.tagfix.bam for one of those reads that were faulted by gatk ValidateSamFile, and post them?
If they look all right, please try to run your STAR command on just this read, and check what comes out.

Cheers
Alex

@J-Moravec
Copy link
Author

J-Moravec commented Jun 29, 2020

Sample read:

From the HLR.tagfix.bam:

NS500269:554:HHJGLBGX5:1:21101:2511:3916 16 chr1 1324592 255 43M13S * 0 0 ACTCTGACCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCCCCATGTACTCTG /EEE/EA/E/AE/AEE/AE/EE/A<EEEE//AEEEEEEEEEEE/EEE/</EA/A<A NH:i:1 HI:i:1 AS:i:40 nM:i:1 TX:Z:ENST00000411962.5,+39,13S43M;ENST00000419704.5,+19,13S43M;ENST00000421495.6,+3,13S43M;ENST00000429572.5,+32,13S43M;ENST00000435064.5,+57,13S43M;ENST00000450926.6,+36,13S43M;ENST00000458452.7,+16,13S43M;ENST00000490853.5,+15,13S43M;ENST00000493534.6,+24,13S43M;ENST00000498173.2,+53,13S43M;ENST00000498476.6,+26,13S43M;ENST00000526113.1,+3,13S43M;ENST00000526332.5,+26,13S43M;ENST00000526797.5,+24,13S43M;ENST00000526904.5,+24,13S43M;ENST00000527098.5,+32,13S43M;ENST00000527719.5,+37,13S43M;ENST00000528879.5,+14,13S43M;ENST00000530031.5,+25,13S43M;ENST00000530233.1,+24,13S43M;ENST00000531019.5,+26,13S43M;ENST00000532952.5,+8,13S43M;ENST00000534345.5,+37,13S43M;ENST00000540437.5,+53,13S43M;ENST00000545578.5,+53,13S43M;ENST00000618806.4,+53,13S43M;ENST00000620829.4,+53,13S43M GX:Z:ENSG00000127054.20 GN:Z:INTS11 RE:A:E BC:Z:CTAAGTTT QT:Z:AAAAA//E CR:Z:CCCAGTTCAGTTCCCT CY:Z:AAAAAEEEAEEEEEEE CB:Z:CCCAGTTCAGTTCCCT-1 UR:Z:ATCTCTCACG UY:Z:EEAEEEEAE< UB:Z:ATCTCTCACG RG:Z:Diermeier_01-HLR_GRCh38:MissingLibrary:1:HHJGLBGX5:1

The first thing that screams to me is what appears to be duplication in the TX tag, but from what I can get from info on the internet, that might be fine as it means that several possible transcripts are possible? And as I am aware, STAR should ignore these anyway?

When I try to grep it in the HLR.Aligned.sortedByCoord.out.bam, I get:

[E::bam_read1] CIGAR and query sequence lengths differ for NS500269:554:HHJGLBGX5:1:21101:2511:3916
[main_samview] truncated file.

but no read.

@alexdobin
Copy link
Owner

Hi Jiří,

this is a bug - the problem is that the TX:Z: string is too long, which causes improper BAM record. I will fix it in the future release, for now you will need to remove the TX:Z: tag from the SAM file before feeding it to STAR.

Cheers
Alex

@J-Moravec
Copy link
Author

J-Moravec commented Jul 15, 2020

I can confirm that removing this TAG does fix the problem.

Should I close this issue or do you want to change the label and close it after a patch?

@alexdobin
Copy link
Owner

Great, thanks for checking it! Let's keep it open for now until the patch is in.

@alexdobin
Copy link
Owner

Hi Jiří

thanks for reporting this bug, I have fixed it in the 2.7.5b release.

Cheers
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants