Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STARsolo: Cell barcode in BAM file is from previous read. #593

Closed
ghuls opened this issue Mar 18, 2019 · 6 comments
Closed

STARsolo: Cell barcode in BAM file is from previous read. #593

ghuls opened this issue Mar 18, 2019 · 6 comments
Labels
issue: code Likely to be an issue with STAR code resolved problem or issue that has been resolved

Comments

@ghuls
Copy link

ghuls commented Mar 18, 2019

Read with barcodes:

$ zcat sample_R1_001.fastq.gz | rg -B4 -A3 'NS500200:339:HGVCJBGX2:1:22301:6695:10874|NS500200:339:HGVCJBGX
2:3:23511:12854:2881|NS500200:339:HGVCJBGX2:4:11503:7841:15656'                                                                                                                                                
@NS500200:339:HGVCJBGX2:4:13606:9213:3620 1:N:0:GACTCAAA                                                                                                                                                       
GTCATTTAGCGTCTATGTCGTTTCAGA                                                                                                                                                                                    
+                                                                                                                                                                                                              
AAAAAEEEEEEEEEEEEEEEEEEEEEA                                                                                                                                                                                    
@NS500200:339:HGVCJBGX2:1:22301:6695:10874 1:N:0:GACTCAAA                                                                                                                                                      
GCGACCACATGCATGTCCCAGTTCCAA                                                                                                                                                                                    
+
AAAAAEEEEEEEEEEEEEEEEEEEEE/
@NS500200:339:HGVCJBGX2:1:13101:1395:10278 1:N:0:CTGCGGCT
AATCCAGCAAAGTGCGATTTTGGGGTT
+
AAAAAEEAEEAE/EEAEEEEEEAEAA/
@NS500200:339:HGVCJBGX2:3:23511:12854:2881 1:N:0:GACTCAAA
CAGCTGGGTTGGTTTGCCCCGGAATAT
+
AAAAAEEEEEEEEEEAEEEEAEAEEE/
@NS500200:339:HGVCJBGX2:1:23105:19029:4614 1:N:0:GACTCAAA
TGGCGCACAGGCAGTAGGATAGCAGGT
+
AAAAAEEEEEEEEEEEEEEEEEEEEE/
@NS500200:339:HGVCJBGX2:4:11503:7841:15656 1:N:0:TCTGTTGG
GCGCAACTCAGCTTAGTCCTGGGAGGG
+
AAAAAEEEEEEEEEEEEEEEEAEAEE<

Read with RNA.

$ zcat sample_R2_001.fastq.gz | rg -B4 -A3 'NS500200:339:HGVCJBGX2:1:22301:6695:10874|NS500200:339:HGVCJBGX2:3:23511:12854:2881|NS500200:339:HGVCJBGX2:4:11503:7841:15656'
@NS500200:339:HGVCJBGX2:4:13606:9213:3620 2:N:0:GACTCAAA
AACCAAACCACTTTCACCGCTACACGACCGGGGGTATACTACGGTCAATGCTCTG
+
A/AAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EAEEEAEEEE6
@NS500200:339:HGVCJBGX2:1:22301:6695:10874 2:N:0:GACTCAAA
CAATCATATTTTCAATTTATATATTGTATTTCTTAATATTATGACCAAGAATTTT
+
AAAAAEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEAEEAEEEEEE/AEEEEEE
@NS500200:339:HGVCJBGX2:1:13101:1395:10278 2:N:0:CTGCGGCT
GGGTGGTTGAGGAGAATAAGTGATACAGTCTATGTGAGTCTCAGTGTAGTGCCTG
+
AA///</EA//AE//EEE/EEE<EAE/A//EA//</<//AEAAE/A//EEE/AEE
@NS500200:339:HGVCJBGX2:3:23511:12854:2881 2:N:0:GACTCAAA
GCCAAGGTGGCTGAACTCAAAGGTCACACATCCCGGGTCCTGAGTCTGACCATGA
+
AA6AAEEEEEEEEEEEEEEEEAEEEEEE6/AE/EAE/EEEE/EAE6E/<EE/E//
@NS500200:339:HGVCJBGX2:1:23105:19029:4614 2:N:0:GACTCAAA
AAGCAGTGGTATCAACGCAGAGTACATGGGAATTCCAAGCAGACGATGGAACAGA
+
AAAAAEEEEEEEEEEEEEAEEEEEEAAEEEEEAEEEEAEEEEEEEEEEEEEEEE/
@NS500200:339:HGVCJBGX2:4:11503:7841:15656 2:N:0:TCTGTTGG
CCAATTATAAAATGGGCAAAAGTGACGGGGCGTGGTGCGTGGTGGCTCATACCGG
+
AAAAAEA/EEEEEEEEEEAE/E/A/AAE/AEEEEEE/A//AE/EE////EEEE/A

BAM file mapped with STAR solo has the wrong barcode (CR). The barcode it seems to put there is the barcode of the previous read in the input fastq file.

samtools view sample_mapped_with_STAR_solo.Aligned.sortedByCoord.out.bam | rg 'NS500200:339:HGVCJBGX2:1:22301:6695:10874|NS500200:339:HGVCJBGX2:3:23511:12854:2881|NS500200:339:HGVCJBGX2:4:11503:7841:15656'
samtools: /apps/leuven/thinking/2014a/software/zlib/1.2.8-foss-2014a/lib/libz.so.1: no version information available (required by samtools)
NS500200:339:HGVCJBGX2:3:23511:12854:2881       0       chr1    43827962        255     22M638N33M      *       0       0       GCCAAGGTGGCTGAACTCAAAGGTCACACATCCCGGGTCCTGAGTCTGACCATGA AA6AAEEEEEEEEEEEEEEEEAEEEEEE6/AE/EAE/EEEE/EAE6E/<EE/E//        NH:i:1  HI:i:1  AS:i:55 nM:i:0  CR:Z:AATCCAGCAAAGTGCG   CY:Z:AAAAAEEAEEAE/EEA   UR:Z:ATTTTGGGGT UY:Z:EEEEEEAEAA
NS500200:339:HGVCJBGX2:1:22301:6695:10874       16      chr7    151163951       3       55M     *       0       0       AAAATTCTTGGTCATAATATTAAGAAATACAATATATAAATTGAAAATATGATTG EEEEEEA/EEEEEEAEEAEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEAAAAA        NH:i:2  HI:i:1  AS:i:54 nM:i:0  CR:Z:GTCATTTAGCGTCTAT   CY:Z:AAAAAEEEEEEEEEEE   UR:Z:GTCGTTTCAG UY:Z:EEEEEEEEEE
NS500200:339:HGVCJBGX2:4:11503:7841:15656       0       chr7    158601857       255     53M2S   *       0       0       CCAATTATAAAATGGGCAAAAGTGACGGGGCGTGGTGCGTGGTGGCTCATACCGG AAAAAEA/EEEEEEEEEEAE/E/A/AAE/AEEEEEE/A//AE/EE////EEEE/A        NH:i:1  HI:i:1  AS:i:50 nM:i:1  CR:Z:TGGCGCACAGGCAGTA   CY:Z:AAAAAEEEEEEEEEEE   UR:Z:GGATAGCAGG UY:Z:EEEEEEEEEE
NS500200:339:HGVCJBGX2:1:22301:6695:10874       256     chr10   46914905        3       55M     *       0       0       CAATCATATTTTCAATTTATATATTGTATTTCTTAATATTATGACCAAGAATTTT AAAAAEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEAEEAEEEEEE/AEEEEEE        NH:i:2  HI:i:2  AS:i:54 nM:i:0  CR:Z:GTCATTTAGCGTCTAT   CY:Z:AAAAAEEEEEEEEEEE   UR:Z:GTCGTTTCAG UY:Z:EEEEEEEEEE

BAM file mapped with CellRanger does not have that problem:

$ samtools view sample_cellranger/outs/possorted_genome_bam.bam | rg 'NS500200:339:HGVCJBGX2:1:22301:6695:10874|NS500200:339:HGVCJBGX2:3:23511:12854:2881|NS500200:339:HGVCJBGX2:4:11503:7841:15656'
samtools: /apps/leuven/thinking/2014a/software/zlib/1.2.8-foss-2014a/lib/libz.so.1: no version information available (required by samtools)
NS500200:339:HGVCJBGX2:3:23511:12854:2881       0       chr1    43827962        255     22M638N33M      *       0       0       GCCAAGGTGGCTGAACTCAAAGGTCACACATCCCGGGTCCTGAGTCTGACCATGA AA6AAEEEEEEEEEEEEEEEEAEEEEEE6/AE/EAE/EEEE/EAE6E/<EE/E//        NH:i:1  HI:i:1  AS:i:55 nM:i:0  RG:Z:MM087_siNTC_cellrangerTest:MissingLibrary:1:HGVCJBGX2:3    TX:Z:uc001cix.3,+1400,55M;uc001ciy.3,+1407,55M  GX:Z:CDC20      GN:Z:CDC20     RE:A:E  CR:Z:CAGCTGGGTTGGTTTG   CY:Z:AAAAAEEEEEEEEEEA   CB:Z:CAGCTGGGTTGGTTTG-1 UR:Z:CCCCGGAATA UY:Z:EEEEAEAEEE UB:Z:CCCCGGAATA
NS500200:339:HGVCJBGX2:1:22301:6695:10874       16      chr7    151163951       255     55M     *       0       0       AAAATTCTTGGTCATAATATTAAGAAATACAATATATAAATTGAAAATATGATTG EEEEEEA/EEEEEEAEEAEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEAAAAA        NH:i:2  HI:i:1  AS:i:54 nM:i:0  RG:Z:MM087_siNTC_cellrangerTest:MissingLibrary:1:HGVCJBGX2:3    TX:Z:uc003wkh.1,+1167,55M       GX:Z:RHEB       GN:Z:RHEB       RE:A:E  MM:i:1CR:Z:GCGACCACATGCATGT    CY:Z:AAAAAEEEEEEEEEEE   CB:Z:GCGACCACATGCATGT-1 UR:Z:CCCAGTTCCA UY:Z:EEEEEEEEEE UB:Z:CCCAGTTCCA
NS500200:339:HGVCJBGX2:4:11503:7841:15656       0       chr7    158601857       255     53M2S   *       0       0       CCAATTATAAAATGGGCAAAAGTGACGGGGCGTGGTGCGTGGTGGCTCATACCGG AAAAAEA/EEEEEEEEEEAE/E/A/AAE/AEEEEEE/A//AE/EE////EEEE/A        NH:i:1  HI:i:1  AS:i:50 nM:i:1  RG:Z:MM087_siNTC_cellrangerTest:MissingLibrary:1:HGVCJBGX2:3    RE:A:N  CR:Z:GCGCAACTCAGCTTAG   CY:Z:AAAAAEEEEEEEEEEE   CB:Z:GCGCAACTCAGCTTAG-1UR:Z:TCCTGGGAGG UY:Z:EEEEEAEAEE UB:Z:TCCTGGGAGG
NS500200:339:HGVCJBGX2:1:22301:6695:10874       256     chr10   46914905        0       55M     *       0       0       CAATCATATTTTCAATTTATATATTGTATTTCTTAATATTATGACCAAGAATTTT AAAAAEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEAEEAEEEEEE/AEEEEEE        NH:i:2  HI:i:2  AS:i:54 nM:i:0  RG:Z:MM087_siNTC_cellrangerTest:MissingLibrary:1:HGVCJBGX2:3    RE:A:N  CR:Z:GCGACCACATGCATGT   CY:Z:AAAAAEEEEEEEEEEE   CB:Z:GCGACCACATGCATGT-1UR:Z:CCCAGTTCCA UY:Z:EEEEEEEEEE UB:Z:CCCAGTTCCA

@alexdobin
Copy link
Owner

Hi Gert,

I fixed the problem with shifted CR/CY/UR/UQ output, please try it from GitHub master.
I also cleaned up the line endings.
If it looks fine on your side, will release 2.7.0f tomorrow, and also the branch with new features, and then pull in your request for longer barcodes.

Cheers
Alex

@ghuls
Copy link
Author

ghuls commented Mar 21, 2019

Thanks a lot for the fix.

Some minor "issue", when --outReadsUnmapped Fastx is used with STAR solo Unmapped.out.mate2 is always empty, so it might be better to not try to write that file when using STAR solo.

Another thing I noticed before. If your input barcode file contains duplicate barcodes, *Solo.out/barcodes.tsv will keep those duplicates instead of only keeping unique ones.
The code in ParametersSolo.cpp is responsible for this:

    //load the CB whitlist and create unordered_map
    ifstream & cbWlStream = ifstrOpen(soloCBwhitelist, ERROR_OUT, "SOLUTION: check the path and permissions of the CB whitelist file: " + soloCBwhitelist, *pP);
    string seq1;
    while (cbWlStream >> seq1) {
        if (seq1.size() != cbL) {
            ostringstream errOut;
            errOut << "EXITING because of FATAL ERROR in input CB whitelist file: "<< soloCBwhitelist <<" the total length of barcode sequence is "  << seq1.size() << " not equal to expected " <<bL <<"\n"  ;
            errOut << "SOLUTION: make sure that the barcode read is the second in --readFilesIn and check that is has the correct formatting\n";
            exitWithError(errOut.str(),std::cerr, pP->inOut->logMain, EXIT_CODE_INPUT_FILES, *pP);
        };
        uint32 cb1;
        //convert to 2-bit format
        if (convertNuclStrToInt32(seq1,cb1)) {
            //cbWL.insert(cb1);
            cbWL.push_back(cb1);
        } else {
            pP->inOut->logMain << "WARNING: CB whitelist sequence contains non-ACGT and is ignored: " << seq1 <<endl;
        };
    };
    //int comp1 = [] (const void *a, const void *b) {uint32 aa=*(uint32*) a; uint32 bb=*(uint32*) b; if (a
    qsort(cbWL.data(),cbWL.size(),sizeof(uint32),funCompareNumbers<uint32>);

According to https://stackoverflow.com/questions/12308243/trying-to-use-qsort-with-vector qsort is also slower than std:sort

Wouldn't it be better to use a std::set?

@alexdobin
Copy link
Owner

Hi Gert,

Thanks for the thorough tests and very helpful reporting!

  1. Will fix the Unmapped.out.mate2 issue, it needs to output the barcode read - this will go into 2.7.0f
  2. Indeed, the WhiteList is presumed to contain no duplicates presently - it might be a good idea to do it. I will add it to the new branch and then to 2.7.1 release.
  3. I completely missed it that container sort is faster than qsort, thanks for pointing it out. I am also doing the sorting to collapse UMI, so sort might speed-up things a little. And, of course, it's safer than qsort.

Cheers
Alex

@alexdobin
Copy link
Owner

Hi Gert,

I have a made a new release that's supposed the fix the problem CR/CY... tags and Unmapped problems.
https://github.com/alexdobin/STAR/releases/tag/2.7.0f

Thanks for reporting them!
Cheers
Alex

@ghuls
Copy link
Author

ghuls commented Mar 29, 2019

Hi Alex,

Thanks for fixing those issues.
Do you also plan to fix the non-uniqueness of soloCBwhitelist in the future?

@alexdobin
Copy link
Owner

Hi Gert,

my plan is to
(i) release new branch today (new features I was working on for a few weeks)
(ii) merge in your pull-in requests
(iii) add uniqueness to the whitelist
(iv) make 2.7.1a release, hopefully next week

Cheers
Alex

@alexdobin alexdobin added the resolved problem or issue that has been resolved label Aug 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue: code Likely to be an issue with STAR code resolved problem or issue that has been resolved
Projects
None yet
Development

No branches or pull requests

2 participants