Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for cell barcodes which are longer than 16 nucleotides. #588

Merged

Conversation

ghuls
Copy link

@ghuls ghuls commented Mar 15, 2019

Add support for cell barcodes which are longer than 16 nucleotides.
The old code overwrote the first nucleotides of the cell barcodes
with As for position 1 till (length cell barcode - 16) as cell
barcodes where converted with "convertNuclStrToInt32" and converted
back to "convertNuclInt32toString" and this would cause an overflow.

The number of barcodes is still limited to 2^32 after this patch.
UMIs are still limited to 16 nucleotides. Longer UMIs could be
supported in a similar way.

Add support for cell barcodes which are longer than 16 nucleotides.
The old code overwrote the first nucleotides of the cell barcodes
with As for position 1 till (length cell barcode - 16) as cell
barcodes where converted with "convertNuclStrToInt32" and converted
back to "convertNuclInt32toString" and this would cause an overflow.

The number of barcodes is still limited to 2^32 after this patch.
UMIs are still limited to 16 nucleotides. Longer UMIs could be
supported in a similar way.
@ghuls ghuls reopened this Apr 5, 2019
@ghuls
Copy link
Author

ghuls commented Apr 5, 2019

Rebased to avoid the whitespace issues which were fixed recently.

@alexdobin
Copy link
Owner

Hi Gert,

I just released the branch I was working on with new features for STARsolo:
https://github.com/alexdobin/STAR/tree/solo_GeneFull

It has an option of working without the whitelist, which might be useful for custom barcodes.
Error correction is not done in this case, and all distinct CBs are accepted.

I will now work on pulling in your requests.

Cheers
Alex

@alexdobin alexdobin changed the base branch from master to solo_GeneFull April 5, 2019 22:22
@ghuls
Copy link
Author

ghuls commented Apr 6, 2019

Hi Alex,

What is new in that branch? Is is only that a white list is not needed anymore, or are there other changes?

It would also be useful for us if #592 is merged (mostly for normal STAR with ATAC data).

We had a new run with a custom design with barcodes of 20bp + 8 bp UMI, but this time other samples (not single-cell) were added which required a longer read 2, so the number of cycles was higher than needed for BC + UMI for read 2 (more based than 20 + 8 + 28). So it would be nice in the future if there was a way to specify where the barcode an UMI are located. Maybe a syntax similar to Illumina bc2fastq might be an option, instead of needing all those long options:

        --soloCBstart 1 \
        --soloCBlen 20 \
        --soloUMIstart 21 \
        --soloUMIlen 8 \

R2:B20U8n4 ==> Read2: barcode of 20 bp, UMI of 8 bp, ignore 4 bp. If this scheme is flexible enough, the barcode could even be in read one. Also reversing of UMI and BC would be possible, or having a spacer between BC and UMI.

EXITING because of FATAL ERROR in input read file: the total length of barcode sequence is 32 not equal to expected 28
Read ID=@NB501171:362:H3NKYBGXB:1:11101:12429:1046 1 N 0   Sequence=CGCACACCAGAATATAAATAAAGCACCAGCTT
SOLUTION: make sure that the barcode read is the second in --readFilesIn and check that is has the correct formatting
          If UMI+CB length is not equal to the barcode read length, specify barcode read length with --soloBarcodeReadLength

Specifying --soloBarcodeReadLength 28 prints the same error.

@alexdobin
Copy link
Owner

Hi Gert,

I have pulled in this request into the new branch, and verified that it works for 16b CB in my test runs.
Would be great if you could test it with your data.

Another major feature in this branch is the ability to count all reads overlapping a gene - i.e. include the reads coming from pre-mRNA, in addition to the standard counting of reads that agree with the spliced transcripts. This is activated with
--soloFeatures Gene GeneFull

It's a good idea to have a string encoding the CB and UMI position, also useful for other technologies like inDrop or sci-RNA-seq or Split-seq.
In your case, you can specify
--soloBarcodeReadLength 0
and STAR will not check whether the read 2 length matches the UMI+CB.

I will pull in the #592 shortly.

Cheers
Alex

@alexdobin alexdobin merged commit 82e470d into alexdobin:solo_GeneFull Apr 8, 2019
@ghuls
Copy link
Author

ghuls commented Apr 9, 2019

@alexdobin Thanks for merging. I think your no whitelist patch is not completely compatible with this branch. It works find as far as I can tell when a whilelist is provided.

# With whitelist specified:
$ tail barcodes.tsv
TTTTTTCCATTTCTACCGGT
TTTTTTCCATTTCTCCCTTC
TTTTTTCCATTTGAATCTGG
TTTTTTCCATTTGGGCTCCC
TTTTTTCCATTTGTACCCAC
TTTTTTCCATTTTATGGAGC
TTTTTTCCATTTTCCTGCGT
TTTTTTCCATTTTCGTCTTC
TTTTTTCCATTTTGGGGGGC
TTTTTTCCATTTTTGCGCAC

# With whitelist None.
$ tail barcodes.tsv
AAAATTTTTTTTTGTTTGAC
AAAATTTTTTTTTTAGCACG
AAAATTTTTTTTTTCCCAAC
AAAATTTTTTTTTTCGCCTG
AAAATTTTTTTTTTCGCCTT
AAAATTTTTTTTTTGAGAAC
AAAATTTTTTTTTTGCGCAC
AAAATTTTTTTTTTTACTTC
AAAATTTTTTTTTTTCCGGT
AAAATTTTTTTTTTTGCTCC

I guess the problem comes from this code in SoloReadBarcode_getCBandUMI.cpp as it assigns cbB (uint64) to cbI (int32):

   if (pSolo.cbWL.size()==0) {//no whitelist - no search
       if (posN!=-1) {//Ns are present, discard this read
           stats.V[stats.nNinBarcode]++;
       } else {//no Ns
           cbI=cbB;//all possible barcodes are accepted
           cbMatch=0;            
       };
       return;
   };

@alexdobin
Copy link
Owner

Hi Gert,

please check the latest commit on this branch, it should solve the problem.

Thanks a lot!
Alex

@ghuls
Copy link
Author

ghuls commented Apr 12, 2019

I get a segmentation fault:

DAT__387aae.STARsolo_hg19_no_whitelist_no_AAAA.pbs: line 13:  5312 Segmentation fault      (core dumped
) /staging/leuven/stg_00002/lcb/ghuls/software/STAR/source/STAR runMode alignReads --soloType Droplet -
-readFilesCommand zcat --readFilesIn "${fastq_R1_filename}" "${fastq_R23_filename}" --genomeDir "${geno
me_dir}" --soloCBwhitelist None --soloBarcodeReadLength 32 --soloCBstart 1 --soloCBlen 20 --soloUMIstar
t 21 --soloUMIlen 8 --alignIntronMax 1 --sjdbGTFfile /staging/leuven/stg_00002/lcb/ghuls/ngs_runs/NextS
eq500_20190228/MCF-7_ATAC_peaks.gff --sjdbGTFfeatureExon Region --sjdbGTFtagExonParentTranscript region
_id --sjdbGTFtagExonParentGene region_name --runThreadN 24 --outSAMtype BAM SortedByCoordinate --outRea
dsUnmapped Fastx --outSAMattributes NH HI AS nM CR CY UR UY --outFileNamePrefix "${output_prefix}" 

In dmesg:

STAR[5312] general protection ip:4258e4 sp:7ffd9c1a7ab0 error:0 in STAR[400000+12c000]

I had also an segmentation fault when I tried to fix this bug myself before.

@alexdobin
Copy link
Owner

Hi Gert,

Thanks a lot for testing it!
I cannot reproduce this problem with your parameters on a mock 20b barcode dataset (I jsut added ACGT to the normal 10X 16b barcode).
Could you try it on a small subset of reads (e.g. using --readMapNumber 1000000) and send it to me, along with the Log.out file and the GFF file.

Cheers
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants