Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Q30 base reporting in CB + UMI when --soloBarcodeReadLength is greater than 26 #798

Open
vishakagopalan opened this issue Dec 23, 2019 · 1 comment
Labels
bug resolved problem or issue that has been resolved

Comments

@vishakagopalan
Copy link

I'm using STARsolo on a 10X v2 scRNA-seq dataset where the barcode read contained the CB + UMI sequence is not 26 base pairs but 150 base pairs in length. In order to run STAR solo, I set the --soloBarcodeReadLength to 150. The following was my invocation command and the output from STAR solo :

STAR --runThreadN $SLURM_CPUS_PER_TASK --genomeDir /fdb/STAR_current/UCSC/hg38/genes-50/ --sjdbOverhang 50 --readFilesIn cdna.fastq.gz barcodes.full.fastq.gz --soloCBwhitelist $DATA/737K-august-2016.txt --soloType CB_UMI_Simple --readFilesCommand zcat --outFileNamePrefix 150bp_barcode_run --soloBarcodeReadLength 150

STAR solo runs perfectly without any error messages in any of the log files. Below is the output of the Summary.csv file in the 150bp_barcode_run/Gene directory :

Number of Reads,250000
Reads With Valid Barcodes,0.969452
Sequencing Saturation,0.0840912
Q30 Bases in CB+UMI,0.598961
Q30 Bases in RNA read,0.880727
Reads Mapped to Genome: Unique+Multiple,0.842508
Reads Mapped to Genome: Unique,0.605768
Reads Mapped to Transcriptome: Unique+Multipe Genes,0.469796
Reads Mapped to Transcriptome: Unique Genes,0.429724
Estimated Number of Cells,4045
Reads in Cells Mapped to Unique Genes,79863
Fraction of Reads in Cells,0.743389
Mean Reads per Cell,19
Median Reads per Cell,15
UMIs in Cells,73178
Mean UMI per Cell,18
Median UMI per Cell,13
Mean Genes per Cell,16
Median Genes per Cell,13
Total Genes Detected,8958

The fraction of Q30 bases in the CB + UMI field is 0.598961. This came as a surprise to me since the following was the fastqc report on the barcodes.full.fastq.gz file :

Per base sequence quality warn
#Base Mean Median Lower Quartile Upper Quartile 10th Percentile 90th Percentile
1 30.86634 32.0 32.0 32.0 32.0 32.0
2 31.60766 32.0 32.0 32.0 32.0 32.0
3 35.4394 37.0 37.0 37.0 32.0 37.0
4 36.03234 37.0 37.0 37.0 32.0 37.0
5 36.15878 37.0 37.0 37.0 37.0 37.0
6 39.73438 41.0 41.0 41.0 37.0 41.0
7 39.729704 41.0 41.0 41.0 37.0 41.0
8 39.937752 41.0 41.0 41.0 37.0 41.0
9 39.856028 41.0 41.0 41.0 37.0 41.0
10-14 39.84311280000001 41.0 41.0 41.0 37.0 41.0
15-19 39.445888000000004 41.0 41.0 41.0 36.0 41.0
20-24 39.3275304 41.0 41.0 41.0 36.0 41.0
25-29 39.6787784 41.0 41.0 41.0 37.0 41.0
30-34 40.0195728 41.0 41.0 41.0 39.4 41.0
35-39 40.152637600000006 41.0 41.0 41.0 41.0 41.0
40-44 40.1812248 41.0 41.0 41.0 41.0 41.0
45-49 40.0375656 41.0 41.0 41.0 41.0 41.0
50-54 38.5896928 41.0 40.2 41.0 33.0 41.0
55-59 31.0855344 36.6 24.0 39.4 12.0 41.0
60-64 27.828351200000004 30.0 16.0 37.0 12.0 41.0
65-69 27.706464000000004 29.0 14.0 38.6 12.0 41.0
70-74 26.9344712 27.0 12.0 37.0 12.0 41.0
75-79 25.335884 26.0 12.0 35.0 12.0 39.4
80-84 26.683612800000002 27.0 12.0 37.0 12.0 41.0
85-89 26.334922400000004 26.0 12.0 36.0 12.0 41.0
90-94 24.975980800000002 27.0 12.0 37.0 12.0 41.0
95-99 25.351571999999997 27.0 12.0 37.0 12.0 41.0
100-104 25.4798168 27.0 12.0 37.0 12.0 41.0
105-109 23.8054096 23.0 12.0 35.0 12.0 41.0
110-114 23.5503648 22.0 12.0 36.0 12.0 41.0
115-119 24.0389576 22.0 12.0 37.0 12.0 41.0
120-124 22.266681600000002 22.0 12.0 32.0 8.0 39.4
125-129 22.3947272 22.0 12.0 32.0 9.6 38.6
130-134 22.271975200000004 22.0 12.0 32.0 8.8 37.8
135-139 22.557953599999998 22.0 12.0 32.0 10.4 39.4
140-144 22.385848000000003 22.0 12.0 32.0 9.6 37.8
145-149 21.060316000000004 20.0 12.0 29.0 8.8 37.0
150 21.028484 22.0 12.0 27.0 8.0 37.0

As you can see, the first 26 bases have a lower quartile Q score of at least 30. But considering that the mean Q scores for the remaining bases are less than 30, I suspected that STAR solo was computing the average base quality scores across all 150 bases of the barcode read instead of restricting the Q30 score calculation to the first 26 bases of the barcode read.

Since it was still possible that the CB + UMI barcode is perhaps starting from somewhere in the middle of the read, as opposed to the first base onwards, I also checked to see if the barcode reads have an issue with where the CB + UMI sequence is located. There seemed to be no issue with that as you can see from a zcat of the barcode file (apologies for the formatting of the fastq reads) ---
zcat barcodes.full.fastq.gz | head -n 100000 | tail -n 20

@ST-E00126:696:HMYGHCCXY:5:1101:7618:3964 1:N:0:AGGATACA
TTAGGCAGTGACGCCTATTCAGCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCCATTTGTCGCTGTTACGATTGTCATAGTGAGCTGAG

AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJAFAAJJFFAFJ7F-7-<-F--7<7<<F<JJJJ-7AFFF-7F-<---77---7----7--7<--7---))--7--7---7-------7----)7)
@ST-E00126:696:HMYGHCCXY:5:1101:7821:3964 1:N:0:AGGATACA
GCACATGATATTATGAACAATACAAATGCATTATTTTTATCATTAATAGTTTAATCATTAATTATCTCATAAGTCAATGCAGAGAGTGAAATTACTATGAATTAATCTTCTGTTCACAATGTACAGTATTTTGCATATGTTGACTTTACT

AAAAFFFFFJFJFFFFJJJJJJJJJJJA<-AFAAFFJJJJ<FJJJJJJJAAJFJJJF-7AF-7AF<<FJJJJJF<-<FFAJFAJ<F7F-AFAFJ<7A7F<A7FFF<F<--AAF<FAFJA<7<-<<<<JAA-7FFFJAFJJJF<JAF<FA7
@ST-E00126:696:HMYGHCCXY:5:1101:8308:3964 1:N:0:AGGATACA
CGCTGGACAACGATCTGGGGAGTGGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCAGAGATGTACTTCTTATTTATTTTGGACCCCCAACAAGGAAGAAAGGTCTCACCCTCAATTCTTCTCTTTCAATACTTTAGTATTGTCACCCCT

AAFFFFJJJJFJJFJJJFJJJJ7JAJAFJJFJJJJAJJJJJJJJJJJJJFFJJJA--77)--7<---7<<-7--77<-A-----<A---7<-<<--A<-<---7A-7--)-)7)7--FJFFAF-7-AJ---777----77---7--)7))
@ST-E00126:696:HMYGHCCXY:5:1101:8471:3964 1:N:0:AGGATACA
AGTGTCAAGATGTTAGTAAATCTAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAACATCCCCAAATAATTTATTTGGACTCAGAATTAAAAGAACATTTGACAGTTTAGAAATGCATGTTTATTCTGAAACCTCTAACCAGTTGGAC

AAFFFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAFF--A7--7)7---7FJA--7----7-)-7---AFJJ7-7<-AF<F-----7-----7A--<7--7-7-AFAA--7<F----<7-----7--7<
@ST-E00126:696:HMYGHCCXY:5:1101:8572:3964 1:N:0:AGGATACA
TTTGTCAGTACGAAATGCCCAGACTATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTCTTCAACTTTTTTAAAACATTTAGATTTTTTTGCACACACCAGTGTTTTTTGTTTTAGCTCATCAGTTATGTGTACTTGATTTTATG

The barcodes are all located before the start of the poly-T tract from the first base onwards. Finally, I decided to truncate the fastq reads to the first 26 positions and re-run STAR solo ---

STAR --runThreadN $SLURM_CPUS_PER_TASK --genomeDir /fdb/STAR_current/UCSC/hg38/genes-50/ --sjdbOverhang 50 --readFilesIn cdna.fastq.gz barcodes.fastq.gz --soloCBwhitelist $DATA/737K-august-2016.txt --soloType CB_UMI_Simple --readFilesCommand zcat --outFileNamePrefix 26bp_barcode_run

This time around, the Q30 values for the CB + UMI read made sense ---
cat 26bp_barcode_runSolo.out/Gene/Summary.csv

Number of Reads,250000
Reads With Valid Barcodes,0.969452
Sequencing Saturation,0.0840912
Q30 Bases in CB+UMI,0.96489
Q30 Bases in RNA read,0.880727
Reads Mapped to Genome: Unique+Multiple,0.842508
Reads Mapped to Genome: Unique,0.605768
Reads Mapped to Transcriptome: Unique+Multipe Genes,0.469796
Reads Mapped to Transcriptome: Unique Genes,0.429724
Estimated Number of Cells,4045
Reads in Cells Mapped to Unique Genes,79863
Fraction of Reads in Cells,0.743389
Mean Reads per Cell,19
Median Reads per Cell,15
UMIs in Cells,73178
Mean UMI per Cell,18
Median UMI per Cell,13
Mean Genes per Cell,16
Median Genes per Cell,13
Total Genes Detected,8958

To me, this confirms that STAR solo is accidentally taking into account bases 26-150 in the barcode read when reporting the Q30 Bases in CB+UMI value in the CSV file.

@alexdobin alexdobin added the bug label Dec 30, 2019
alexdobin added a commit that referenced this issue Jan 31, 2020
… calculated only for CB/UMI portions of the read (#798).
@alexdobin
Copy link
Owner

Hi @vishakadsg

thanks a lot for the very detailed report of the problem!
I fixed the bug - please try the develop branch:
https://github.com/alexdobin/STAR/tree/develop

Cheers
Alex

@alexdobin alexdobin added the resolved problem or issue that has been resolved label Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug resolved problem or issue that has been resolved
Projects
None yet
Development

No branches or pull requests

2 participants