ema align output SAM parsing error #39

pontushojer · 2020-10-21T16:06:20Z

I have been running ema (version 0.6.2) on reads in the longranger basic FASTQ format (BX:Z in header). I pipe the output directly to samtools sort. My command looks something like this.

ema align -1 <(pigz -cd in.1.fastq.gz) -2 <(pigz -cd in.2.fastq.gz) -r genome.fa -t 20 -p 10x 2> mapping.log | samtools sort - -@ 20 -o mapping.bam 2> sorting.log

Mostly this have been working fine but every other run have failed because samtools sort gets a parsing error

Either this one

[E::sam_parse1] unrecognized CIGAR operator

Or this one

[E::sam_parse1] CIGAR and query sequence are of different length

Both are related to the CIGAR so I think there is some formatting error here that creeps in every now and then. I looked SAM entries that caused the error in the one of the runs and found some strange things. Below are four lines, of which the third (marked in bold) is causing the error.

ST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG	83	chr1	146535682	3	66M	=	146535535	-213	GATTCTGGTGACCTCACTGTAGGGAGACAGGAGCCTCTGGCAAGTTACCAAACTGCCAGCCTATTC	JFJJJFJJJJFFFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFJJJFFJ	NM:i:0	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:0.5	MI:i:-1470287527	XF:i:0	RG:Z:4	XA:Z:chr1,+148197632,66M,0;
ST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG	163	chr1	146535535	3	151M	=	146535682	213	TTTTATTCCCATATTCAATTTTCTTCTTTTCACTTGGAGAAAACCTCTTAGGCTAAAGACCATTGGAAGTGGCCTCCATGTTTAACTCACTGGCCCCTAATTTCTGCTGGACCTTCATTTCAATCGACTGTGTTTCAGGAGGGAAGCGATT	AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJ	NM:i:0	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:0.5	MI:i:-1470287527	XF:i:0	RG:Z:4	XA:Z:chr1,-148197694,151M,0;
ST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG	89	chr2	158084867	48	4214864^@	*	0	0	GATTCTGGTGACCTCACTGTAGGGAGACAGGAGCCTCTGGCAAGTTACCAAACTGCCAGCCTATTC	JFJJJFJJJJFFFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFJJJFFJ	NM:i:4	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:1	MI:i:-1470287525	XF:i:0	RG:Z:4
ST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG	165	*	0	0	*	chr2	158084867	0	TTTTATTCCCATATTCAATTTTCTTCTTTTCACTTGGAGAAAACCTCTTAGGCTAAAGACCATTGGAAGTGGCCTCCATGTTTAACTCACTGGCCCCTAATTTCTGCTGGACCTTCATTTCAATCGACTGTGTTTCAGGAGGGAAGCGATT	AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJ	BX:Z:AAAAAAAAAAAAAAAA-1	RG:Z:4

As you can see this SAM entry is just plain strange. It aslo contains a ^@ character for some reason. Interestingly the SEQ and QUAL strings for this entry does not below to the read with the QNAME ST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG instead it belongs to the first entry in my example named ST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG. This is also the read-pair just before the one causing the error in my FASTQ.

I have tried to replicate the error on a smaller subset of my data but have so far been unsuccessful. For example if i take the FASTQ entries corresponding to the failed SAM entries shown above I don't get any error. So somehow this only happens when running the full dataset.

Do you have any idea what could be causing this?

The text was updated successfully, but these errors were encountered:

arshajii · 2020-10-28T22:40:35Z

Hi @pontushojer,
Sorry for the delay in responding. This is quite strange, and looks like a bug that only comes up when processing a large set of reads.

Couple quick questions:

Are you running a pre-built version of EMA (e.g. installed from brew/conda)? Or are you using a version built from source?
How big is the dataset that causes this issue? For debugging purposes, it would be very helpful if we could produce some subset of it that still causes this error. (Or if it's not too big, any chance you could share it so we can use it in debugging?)

pontushojer · 2020-10-29T09:46:49Z

Hi @arshajii,

No worries!

Are you running a pre-built version of EMA (e.g. installed from brew/conda)? Or are you using a version built from source?

I am running a pre-built version from conda, version 0.6.2 build h8b12597_1.

How big is the dataset that causes this issue? For debugging purposes, it would be very helpful if we could produce some subset of it that still causes this error. (Or if it's not too big, any chance you could share it so we can use it in debugging?)

The datasets have been about 400-500 M read-pairs, so far I have had issues on about three of my dataset.

I have so far been unable to generate a smaller dataset to replicate the issue. If I extract the read-pairs for barcodes surrounding the entry that causes the error in the full dataset, it completes without error. I will continue to try and generate a subset, as you say it would help narrowing this down.

I can check about sending a full dataset...

pontushojer · 2020-11-19T10:58:49Z

@arshajii I have now managed to generate a smaller subset that can recreate the issue.

Running the following:

ema align -1 <(pigz -cd failing.fastq.gz) -R '@RG\tID:1\tSM:20\tPU:unit1\tPL:ILLUMINA' -r genome.fa -t 4 -p 10x 2> mapping.log | samtools sort - -@ 4 -o out.bam -O BAM -l 0 2> sorting.log

outputs this to the sorting.log:

[E::sam_parse1] CIGAR and query sequence are of different length
samtools sort: truncated file. Aborting

If I skip the pipe to samtools sort and look at the unsorted file from ema I find two faulty entries:

A00187:292:H7G2JDSXY:3:2674:9516:18662:TTTTTTTGTAAGGAACTGAA	73	chrX	147384013	60	8006656M	*	0	0	ATAAAATTAAAAAAAAAAAAAAAAAAAAAAAAATATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	F:FF,F:F,FFF,:FFFFF:FFFFFFF,FFFFFFF,FFFFFFFFFFFFFFF,FFFFFF,,FFFFFF	NM:i:0	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:1	MI:i:1620035	XF:i:0	RG:Z:1
...
A00187:292:H7G2JDSXY:3:1271:24126:15405:AAAAAAAAAAAAATAAAAAA	73	chr3	26999354	22	46139657M691D46137351c691D	*	0	0	CCCCCTCATTGTCCTTGTCTATTACATTTTTATTTTTATATTATAATAGCTTATGGTATGTAAT	FF:F::FF:F,:FF:F:FF,FFFFFFF:,FF,FFFF:FFFFFFFFFFFF,,FFFF:FF::FF,F	NM:i:4	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:1	MI:i:1620051	XF:i:0	RG:Z:1

As you see the cigars are 8006656M and 46139657M691D46137351c691D respectively which are both wrong. Also they have the input barcodes of TTTTTTTGTAAGGAACTGAA and AAAAAAAAAAAAATAAAAAA (found in read name) but the tagged barcode is BX:Z:AAAAAAAAAAAAAAAA-1 for both.

Hope this helps to locate the issue!

Subset: failing.fastq.gz

pontushojer · 2021-01-20T16:36:55Z

Hi @arshajii.

I was wondering if you have had the opportunity too look into this issue after I posted the subset?

pontushojer mentioned this issue Oct 28, 2020

Rule map_sort sometimes fails for ema FrickTobias/BLR#235

Closed

pontushojer mentioned this issue Oct 2, 2023

Rule map_sort sometimes fails for ema AfshinLab/BLR#73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ema align output SAM parsing error #39

ema align output SAM parsing error #39

pontushojer commented Oct 21, 2020

arshajii commented Oct 28, 2020

pontushojer commented Oct 29, 2020

pontushojer commented Nov 19, 2020

pontushojer commented Jan 20, 2021

ema align output SAM parsing error #39

ema align output SAM parsing error #39

Comments

pontushojer commented Oct 21, 2020

arshajii commented Oct 28, 2020

pontushojer commented Oct 29, 2020

pontushojer commented Nov 19, 2020

pontushojer commented Jan 20, 2021