Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STARsolo: Only one read in the BAM file contain filtered barcode #954

Closed
J-Moravec opened this issue Jun 25, 2020 · 10 comments
Closed

STARsolo: Only one read in the BAM file contain filtered barcode #954

J-Moravec opened this issue Jun 25, 2020 · 10 comments

Comments

@J-Moravec
Copy link

I am remapping a BAM file created by 10x cell_ranger (which use STAR internally).

I have converted BAM into fastq files with 10X bamtofastq. From these files, R1 seems to contain a barcode, R2 seems to contains a read and I1 contains god knows what.

I have fed these into STARsolo (in the order R2 R1 as described in manual) and got Aligned.sortedByCoords.out.BAM as well as filtered barcodes in Solo.out/Gene/filtered/barcodes.tsv which should, if I understand everything correctly, contain barcodes (CB) for well-represented cells with a lot of reads.

I then used pysam to check this, but by running:

import pysam

file = "Aligned.sortedByCoords.out.bam"
barcode = <barcode>

with pysam.AlignmentFile(file, "rb") as bam
  reads_with_barcode = [read for read in bam.fetch(until_eof=True) if read.has_tag("CB") and read.get_tag("CB") == barcode]
  len(reads_with_barcode)

which should search the bam file and found every single read contained particular barcode. Running this with the first barcode in filtered/barcodes.tsv I got... 1. A single read contained this barcode. I am completely baffled by this.

Also, the number of barcodes changes from 160 000 (original BAM) to some 260 000 (new BAM).
I am also confused by Barcodes.stats, I got all zeros: https://termbin.com/6gpts

Both of these would suggest that maybe I didn't do something right?

Log files:
Log.out: https://termbin.com/xlgt
Log.final.out: https://termbin.com/l1e9

Thank you.

@alexdobin
Copy link
Owner

Let's first figure out why there are only zeros in Barcodes.stats. The Log.out at the end shows that barcodes were detected. What is the content of Solo.out/Gene/Features.stats and Solo.out/Gene/Summary.csv?

@J-Moravec
Copy link
Author

Both seems to be well populated:

Features: https://termbin.com/76zg

Summary: https://termbin.com/42hb

In the Summary, the Reads With Valid Barcode is suspicious, but I presume this is 100% and not literally a single read with a valid barcode.

@alexdobin
Copy link
Owner

Right, this is 100% - it means that all the barcodes match the whitelist, i.e. all were error corrected, which should not happen. This explains why most lines in Barcode.stats are zero, but nExactMatch should not be zero - should contain total number of reads.

Can you try to cut ~100k lines from the original BAM, and go with the whole procedure again - convert bam2fastq and then run STARsolo. If it still has this problem, can you share this small BAM, so that I can reproduce the problem.

Thanks!
Alex

@J-Moravec
Copy link
Author

J-Moravec commented Jun 29, 2020

Done the procedure again, only 100k reads:

samtools view data/HLR100k.bam | wc -l
# 100000

Log: https://termbin.com/qdo9

Log.final.out: https://termbin.com/riue

                          Number of input reads |	47080

Somehow I have lost 47080 reads?

I have checked the fastq files produced by 10Xs bamtofastq and there truly only 47080 reads! So Unless I don't get something in bamfiles or fastq files or bamtofastq just ate half of my reads! I almost just went around and used cellranger count with the bamtofastq, but that would be a grave error without knowing this!

Never mind. I checked the aligned file out of STAR and it had above 100k reads. So the "duplicated" reads are reads that were mapped on several places. I tried to grep a read from the original file in the new file and couldn't get a match, so possibly some read names (the last two numbers) can change? Or the bamtofastq changes names?

Also apparently some reads do not have CB tag?

samtools view STARsolo/BAM/HLR100k.Aligned.sortedByCoord.out.bam | wc -l
#102281

but:

samtools view STARsolo/BAM/HLR100k.Aligned.sortedByCoord.out.bam | grep "CB" | wc -l
# 30190

So continuing exploration:

Barcodes.stats are exactly the same, all zeros.

Features: https://termbin.com/8wu4
Summary: https://termbin.com/tc9k

Right, this is 100% - it means that all the barcodes match the whitelist, i.e. all were error corrected, which should not happen.

I don't have whitelist. I presumed that since I am working on already preprocessed data, I don't need it?

Regarding bamtofastq, here is info from header:

@CO     10x_bam_to_fastq:I1(BC:QT)
@CO     10x_bam_to_fastq:R1(CR:CY,UR:UY)
@CO     10x_bam_to_fastq:R2(SEQ:QUAL)

And first few lines in fastq from bamtofastq

cat bamtofastq_S1_L001_I1_001.fastq.gz | gunzip | head
@NS500269:554:HHJGLBGX5:1:12305:24406:14142 2:N:0:0
CTAAGTTT
+
AAAAAE6E
@NS500269:554:HHJGLBGX5:1:13102:9918:2719 2:N:0:0
TACCACCA
+
AAAAAEEE
@NS500269:554:HHJGLBGX5:1:13304:15435:15006 2:N:0:0
TACCACCA
cat bamtofastq_S1_L001_R1_001.fastq.gz | gunzip | head
@NS500269:554:HHJGLBGX5:1:12305:24406:14142 1:N:0:0
GTTACAGAGTGACTCTGATCAGCGCC
+
AAAAAEE6EEE6EEE/A/EE/EE/E<
@NS500269:554:HHJGLBGX5:1:13102:9918:2719 1:N:0:0
GATGCTACAACTGGCCTCCCTACCGT
+
AAAAAEEEEEEEEEEEEEEEEEEEEE
@NS500269:554:HHJGLBGX5:1:13304:15435:15006 1:N:0:0
CATGGCGGTAATCACCAGTTTTTCAG
cat bamtofastq_S1_L001_R2_001.fastq.gz | gunzip | head
@NS500269:554:HHJGLBGX5:1:12305:24406:14142 3:N:0:0
ATAAAGAACTGAGCAGAAACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTCTG
+
AA6AA6EEEE6//EEE///EAEEEEEEAE6E//E/EAE/E///6EEEEEEEAE/EA
@NS500269:554:HHJGLBGX5:1:13102:9918:2719 3:N:0:0
GGCTTCCAGAGAAAATGGCACACCAATCAATAAAGAACTGAGCAGAAAAAAAAAAA
+
A6AAAEEEEAEEEEEEEEEAEEEAAEEEAEE66EE/EEEAAAE6E/A6AAA6A6E/
@NS500269:554:HHJGLBGX5:1:13304:15435:15006 3:N:0:0
GCGCAACTGTTTACTTAAGAGGCTTCCAGAGAAAAATGCACACAAATCAATAAAGA

Anything suspicious looking?

Btw. thanks again Alex for helping me here.

@alexdobin
Copy link
Owner

Aah, I missed that you are were running it without whitelist. I think this explains all the "weird" observations.
You need to specify whitelist for proper operation (i.e. very similar to CellRanger), otherwise, the error correction of barcodes is not performed. This looks like 10X v2 data, so you need the 737K-august-2016.txt whitelist from 10X. I have it here http://labshare.cshl.edu/shares/dobin/dobin/STARsolo/10X/whitelists/737K-august-2016.txt.gz

@J-Moravec
Copy link
Author

I can confirm that this fixed the problem. Strange that the None value is allowed but stuff breaks with it.

@alexdobin
Copy link
Owner

Hi Jiří

does it break or does it run to completion?
I think it just produces different results - as expected since without whitelist there is no error-correction of barcodes.
This option is used for "classic" non-10X Drop-seq, which usually do not have barcode lists.

Cheers
Alex

@J-Moravec
Copy link
Author

It runs into completion. What I meant by break is that (as title says) only a single read contained the filtered barcode, which was quite surprising. Even without barcode correction, I would expect more. Or less filtered barcodes. But the number of filtered barcodes remain relatively (if not completely) constant and similar (if not the same) as the number of filtered barcodes obtained from 10x cellranger count.

@alexdobin
Copy link
Owner

You are right - for filtered barcodes, even without the whitelist, you should see many reads in the BAM file.
I do not see this issue in my test. Could you try to just grep this barcode from the BAM file (mapped with --soloCBwhitelist None), i.e. something like
samtools view Aligned.out.bam | grep AAACCTGAGTGAAGAG
To speed it up, you can map only first 100000 reads with --readMapNumber 100000

Thanks!
Alex

@alexdobin alexdobin reopened this Jul 16, 2020
@J-Moravec
Copy link
Author

Can't replicate the problem any more after the problem with TX:Z tag from #952 was fixed.

STARsolo reported correct tax with even with --soloCBwhitelist None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants