Add support for cell barcodes which are longer than 16 nucleotides. #588

ghuls · 2019-03-15T12:09:35Z

Add support for cell barcodes which are longer than 16 nucleotides.
The old code overwrote the first nucleotides of the cell barcodes
with As for position 1 till (length cell barcode - 16) as cell
barcodes where converted with "convertNuclStrToInt32" and converted
back to "convertNuclInt32toString" and this would cause an overflow.

The number of barcodes is still limited to 2^32 after this patch.
UMIs are still limited to 16 nucleotides. Longer UMIs could be
supported in a similar way.

Add support for cell barcodes which are longer than 16 nucleotides. The old code overwrote the first nucleotides of the cell barcodes with As for position 1 till (length cell barcode - 16) as cell barcodes where converted with "convertNuclStrToInt32" and converted back to "convertNuclInt32toString" and this would cause an overflow. The number of barcodes is still limited to 2^32 after this patch. UMIs are still limited to 16 nucleotides. Longer UMIs could be supported in a similar way.

ghuls · 2019-04-05T14:33:38Z

Rebased to avoid the whitespace issues which were fixed recently.

alexdobin · 2019-04-05T21:16:44Z

Hi Gert,

I just released the branch I was working on with new features for STARsolo:
https://github.com/alexdobin/STAR/tree/solo_GeneFull

It has an option of working without the whitelist, which might be useful for custom barcodes.
Error correction is not done in this case, and all distinct CBs are accepted.

I will now work on pulling in your requests.

Cheers
Alex

ghuls · 2019-04-06T20:52:41Z

Hi Alex,

What is new in that branch? Is is only that a white list is not needed anymore, or are there other changes?

It would also be useful for us if #592 is merged (mostly for normal STAR with ATAC data).

We had a new run with a custom design with barcodes of 20bp + 8 bp UMI, but this time other samples (not single-cell) were added which required a longer read 2, so the number of cycles was higher than needed for BC + UMI for read 2 (more based than 20 + 8 + 28). So it would be nice in the future if there was a way to specify where the barcode an UMI are located. Maybe a syntax similar to Illumina bc2fastq might be an option, instead of needing all those long options:

        --soloCBstart 1 \
        --soloCBlen 20 \
        --soloUMIstart 21 \
        --soloUMIlen 8 \

R2:B20U8n4 ==> Read2: barcode of 20 bp, UMI of 8 bp, ignore 4 bp. If this scheme is flexible enough, the barcode could even be in read one. Also reversing of UMI and BC would be possible, or having a spacer between BC and UMI.

EXITING because of FATAL ERROR in input read file: the total length of barcode sequence is 32 not equal to expected 28
Read ID=@NB501171:362:H3NKYBGXB:1:11101:12429:1046 1 N 0   Sequence=CGCACACCAGAATATAAATAAAGCACCAGCTT
SOLUTION: make sure that the barcode read is the second in --readFilesIn and check that is has the correct formatting
          If UMI+CB length is not equal to the barcode read length, specify barcode read length with --soloBarcodeReadLength

Specifying --soloBarcodeReadLength 28 prints the same error.

alexdobin · 2019-04-08T21:57:08Z

Hi Gert,

I have pulled in this request into the new branch, and verified that it works for 16b CB in my test runs.
Would be great if you could test it with your data.

Another major feature in this branch is the ability to count all reads overlapping a gene - i.e. include the reads coming from pre-mRNA, in addition to the standard counting of reads that agree with the spliced transcripts. This is activated with
--soloFeatures Gene GeneFull

It's a good idea to have a string encoding the CB and UMI position, also useful for other technologies like inDrop or sci-RNA-seq or Split-seq.
In your case, you can specify
--soloBarcodeReadLength 0
and STAR will not check whether the read 2 length matches the UMI+CB.

I will pull in the #592 shortly.

Cheers
Alex

ghuls · 2019-04-09T14:28:03Z

@alexdobin Thanks for merging. I think your no whitelist patch is not completely compatible with this branch. It works find as far as I can tell when a whilelist is provided.

# With whitelist specified:
$ tail barcodes.tsv
TTTTTTCCATTTCTACCGGT
TTTTTTCCATTTCTCCCTTC
TTTTTTCCATTTGAATCTGG
TTTTTTCCATTTGGGCTCCC
TTTTTTCCATTTGTACCCAC
TTTTTTCCATTTTATGGAGC
TTTTTTCCATTTTCCTGCGT
TTTTTTCCATTTTCGTCTTC
TTTTTTCCATTTTGGGGGGC
TTTTTTCCATTTTTGCGCAC

# With whitelist None.
$ tail barcodes.tsv
AAAATTTTTTTTTGTTTGAC
AAAATTTTTTTTTTAGCACG
AAAATTTTTTTTTTCCCAAC
AAAATTTTTTTTTTCGCCTG
AAAATTTTTTTTTTCGCCTT
AAAATTTTTTTTTTGAGAAC
AAAATTTTTTTTTTGCGCAC
AAAATTTTTTTTTTTACTTC
AAAATTTTTTTTTTTCCGGT
AAAATTTTTTTTTTTGCTCC

I guess the problem comes from this code in SoloReadBarcode_getCBandUMI.cpp as it assigns cbB (uint64) to cbI (int32):

   if (pSolo.cbWL.size()==0) {//no whitelist - no search
       if (posN!=-1) {//Ns are present, discard this read
           stats.V[stats.nNinBarcode]++;
       } else {//no Ns
           cbI=cbB;//all possible barcodes are accepted
           cbMatch=0;            
       };
       return;
   };

alexdobin · 2019-04-11T23:03:08Z

Hi Gert,

please check the latest commit on this branch, it should solve the problem.

Thanks a lot!
Alex

ghuls · 2019-04-12T12:45:35Z

I get a segmentation fault:

DAT__387aae.STARsolo_hg19_no_whitelist_no_AAAA.pbs: line 13:  5312 Segmentation fault      (core dumped
) /staging/leuven/stg_00002/lcb/ghuls/software/STAR/source/STAR runMode alignReads --soloType Droplet -
-readFilesCommand zcat --readFilesIn "${fastq_R1_filename}" "${fastq_R23_filename}" --genomeDir "${geno
me_dir}" --soloCBwhitelist None --soloBarcodeReadLength 32 --soloCBstart 1 --soloCBlen 20 --soloUMIstar
t 21 --soloUMIlen 8 --alignIntronMax 1 --sjdbGTFfile /staging/leuven/stg_00002/lcb/ghuls/ngs_runs/NextS
eq500_20190228/MCF-7_ATAC_peaks.gff --sjdbGTFfeatureExon Region --sjdbGTFtagExonParentTranscript region
_id --sjdbGTFtagExonParentGene region_name --runThreadN 24 --outSAMtype BAM SortedByCoordinate --outRea
dsUnmapped Fastx --outSAMattributes NH HI AS nM CR CY UR UY --outFileNamePrefix "${output_prefix}"

In dmesg:

STAR[5312] general protection ip:4258e4 sp:7ffd9c1a7ab0 error:0 in STAR[400000+12c000]

I had also an segmentation fault when I tried to fix this bug myself before.

alexdobin · 2019-04-12T19:16:00Z

Hi Gert,

Thanks a lot for testing it!
I cannot reproduce this problem with your parameters on a mock 20b barcode dataset (I jsut added ACGT to the normal 10X 16b barcode).
Could you try it on a small subset of reads (e.g. using --readMapNumber 1000000) and send it to me, along with the Log.out file and the GFF file.

Cheers
Alex

ghuls mentioned this pull request Mar 15, 2019

STARsolo: "--soloCBlen 20" mangles cell barcodes #587

Closed

ghuls closed this Apr 5, 2019

ghuls force-pushed the support_CBs_greater_than_16 branch from 5cdf9bf to 35048f8 Compare April 5, 2019 14:28

ghuls reopened this Apr 5, 2019

alexdobin changed the base branch from master to solo_GeneFull April 5, 2019 22:22

alexdobin merged commit 82e470d into alexdobin:solo_GeneFull Apr 8, 2019

ghuls mentioned this pull request Apr 17, 2020

STARsolo: support for multiple (3) barcode locations #838

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for cell barcodes which are longer than 16 nucleotides. #588

Add support for cell barcodes which are longer than 16 nucleotides. #588

ghuls commented Mar 15, 2019

ghuls commented Apr 5, 2019

alexdobin commented Apr 5, 2019

ghuls commented Apr 6, 2019

alexdobin commented Apr 8, 2019

ghuls commented Apr 9, 2019

alexdobin commented Apr 11, 2019

ghuls commented Apr 12, 2019

alexdobin commented Apr 12, 2019

Add support for cell barcodes which are longer than 16 nucleotides. #588

Add support for cell barcodes which are longer than 16 nucleotides. #588

Conversation

ghuls commented Mar 15, 2019

ghuls commented Apr 5, 2019

alexdobin commented Apr 5, 2019

ghuls commented Apr 6, 2019

alexdobin commented Apr 8, 2019

ghuls commented Apr 9, 2019

alexdobin commented Apr 11, 2019

ghuls commented Apr 12, 2019

alexdobin commented Apr 12, 2019