Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ema align output SAM parsing error #39

Open
pontushojer opened this issue Oct 21, 2020 · 4 comments
Open

ema align output SAM parsing error #39

pontushojer opened this issue Oct 21, 2020 · 4 comments

Comments

@pontushojer
Copy link
Contributor

I have been running ema (version 0.6.2) on reads in the longranger basic FASTQ format (BX:Z in header). I pipe the output directly to samtools sort. My command looks something like this.

ema align -1 <(pigz -cd in.1.fastq.gz) -2 <(pigz -cd in.2.fastq.gz) -r genome.fa -t 20 -p 10x 2> mapping.log | samtools sort - -@ 20 -o mapping.bam 2> sorting.log

Mostly this have been working fine but every other run have failed because samtools sort gets a parsing error

Either this one

[E::sam_parse1] unrecognized CIGAR operator

Or this one

[E::sam_parse1] CIGAR and query sequence are of different length

Both are related to the CIGAR so I think there is some formatting error here that creeps in every now and then. I looked SAM entries that caused the error in the one of the runs and found some strange things. Below are four lines, of which the third (marked in bold) is causing the error.

ST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG	83	chr1	146535682	3	66M	=	146535535	-213	GATTCTGGTGACCTCACTGTAGGGAGACAGGAGCCTCTGGCAAGTTACCAAACTGCCAGCCTATTC	JFJJJFJJJJFFFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFJJJFFJ	NM:i:0	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:0.5	MI:i:-1470287527	XF:i:0	RG:Z:4	XA:Z:chr1,+148197632,66M,0;
ST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG	163	chr1	146535535	3	151M	=	146535682	213	TTTTATTCCCATATTCAATTTTCTTCTTTTCACTTGGAGAAAACCTCTTAGGCTAAAGACCATTGGAAGTGGCCTCCATGTTTAACTCACTGGCCCCTAATTTCTGCTGGACCTTCATTTCAATCGACTGTGTTTCAGGAGGGAAGCGATT	AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJ	NM:i:0	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:0.5	MI:i:-1470287527	XF:i:0	RG:Z:4	XA:Z:chr1,-148197694,151M,0;
ST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG	89	chr2	158084867	48	4214864^@	*	0	0	GATTCTGGTGACCTCACTGTAGGGAGACAGGAGCCTCTGGCAAGTTACCAAACTGCCAGCCTATTC	JFJJJFJJJJFFFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFJJJFFJ	NM:i:4	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:1	MI:i:-1470287525	XF:i:0	RG:Z:4
ST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG	165	*	0	0	*	chr2	158084867	0	TTTTATTCCCATATTCAATTTTCTTCTTTTCACTTGGAGAAAACCTCTTAGGCTAAAGACCATTGGAAGTGGCCTCCATGTTTAACTCACTGGCCCCTAATTTCTGCTGGACCTTCATTTCAATCGACTGTGTTTCAGGAGGGAAGCGATT	AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJ	BX:Z:AAAAAAAAAAAAAAAA-1	RG:Z:4

As you can see this SAM entry is just plain strange. It aslo contains a ^@ character for some reason. Interestingly the SEQ and QUAL strings for this entry does not below to the read with the QNAME ST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG instead it belongs to the first entry in my example named ST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG. This is also the read-pair just before the one causing the error in my FASTQ.

I have tried to replicate the error on a smaller subset of my data but have so far been unsuccessful. For example if i take the FASTQ entries corresponding to the failed SAM entries shown above I don't get any error. So somehow this only happens when running the full dataset.

Do you have any idea what could be causing this?

@arshajii
Copy link
Owner

Hi @pontushojer,
Sorry for the delay in responding. This is quite strange, and looks like a bug that only comes up when processing a large set of reads.

Couple quick questions:

  • Are you running a pre-built version of EMA (e.g. installed from brew/conda)? Or are you using a version built from source?
  • How big is the dataset that causes this issue? For debugging purposes, it would be very helpful if we could produce some subset of it that still causes this error. (Or if it's not too big, any chance you could share it so we can use it in debugging?)

@pontushojer
Copy link
Contributor Author

Hi @arshajii,

No worries!

  • Are you running a pre-built version of EMA (e.g. installed from brew/conda)? Or are you using a version built from source?

I am running a pre-built version from conda, version 0.6.2 build h8b12597_1.

  • How big is the dataset that causes this issue? For debugging purposes, it would be very helpful if we could produce some subset of it that still causes this error. (Or if it's not too big, any chance you could share it so we can use it in debugging?)

The datasets have been about 400-500 M read-pairs, so far I have had issues on about three of my dataset.

I have so far been unable to generate a smaller dataset to replicate the issue. If I extract the read-pairs for barcodes surrounding the entry that causes the error in the full dataset, it completes without error. I will continue to try and generate a subset, as you say it would help narrowing this down.

I can check about sending a full dataset...

@pontushojer
Copy link
Contributor Author

@arshajii I have now managed to generate a smaller subset that can recreate the issue.

Running the following:

ema align -1 <(pigz -cd failing.fastq.gz) -R '@RG\tID:1\tSM:20\tPU:unit1\tPL:ILLUMINA' -r genome.fa -t 4 -p 10x 2> mapping.log | samtools sort - -@ 4 -o out.bam -O BAM -l 0 2> sorting.log

outputs this to the sorting.log:

[E::sam_parse1] CIGAR and query sequence are of different length
samtools sort: truncated file. Aborting

If I skip the pipe to samtools sort and look at the unsorted file from ema I find two faulty entries:

A00187:292:H7G2JDSXY:3:2674:9516:18662:TTTTTTTGTAAGGAACTGAA	73	chrX	147384013	60	8006656M	*	0	0	ATAAAATTAAAAAAAAAAAAAAAAAAAAAAAAATATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	F:FF,F:F,FFF,:FFFFF:FFFFFFF,FFFFFFF,FFFFFFFFFFFFFFF,FFFFFF,,FFFFFF	NM:i:0	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:1	MI:i:1620035	XF:i:0	RG:Z:1
...
A00187:292:H7G2JDSXY:3:1271:24126:15405:AAAAAAAAAAAAATAAAAAA	73	chr3	26999354	22	46139657M691D46137351c691D	*	0	0	CCCCCTCATTGTCCTTGTCTATTACATTTTTATTTTTATATTATAATAGCTTATGGTATGTAAT	FF:F::FF:F,:FF:F:FF,FFFFFFF:,FF,FFFF:FFFFFFFFFFFF,,FFFF:FF::FF,F	NM:i:4	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:1	MI:i:1620051	XF:i:0	RG:Z:1

As you see the cigars are 8006656M and 46139657M691D46137351c691D respectively which are both wrong. Also they have the input barcodes of TTTTTTTGTAAGGAACTGAA and AAAAAAAAAAAAATAAAAAA (found in read name) but the tagged barcode is BX:Z:AAAAAAAAAAAAAAAA-1 for both.

Hope this helps to locate the issue!

Subset: failing.fastq.gz

@pontushojer
Copy link
Contributor Author

Hi @arshajii.

I was wondering if you have had the opportunity too look into this issue after I posted the subset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants