Size for sjdbOverhang for multiple read length datasets #931

santiagorevale · 2020-06-05T10:41:50Z

Hi Alex,

I was wondering which value should I specify when creating the reference for the --sjdbOverhang parameter if I want to use the same reference index against 2x75, 2x101, and 2x151 bp long datasets.

Following the documentation, it says to use max(ReadLength)-1m which in my case was 150, but then when I try to run the alignment for one of the shorter read length datasets, I get the error:

Jun 05 11:28:04 ..... started STAR run
Jun 05 11:28:04 ..... loading genome

EXITING because of fatal PARAMETERS error: present --sjdbOverhang=100 is not equal to the value at the genome generation step =150
SOLUTION: 

Jun 05 11:28:04 ...... FATAL ERROR, exiting

I'm creating the reference in with the following command:

STAR \
    --runMode genomeGenerate \
    --genomeDir STARindex \
    --genomeFastaFiles genome.fa \
    --sjdbGTFfile genes.gtf \
    --sjdbOverhang 150 \
    --runThreadN 24

and I'm running the aligner with the following command:

STAR \
    --genomeDir STARindex \
    --runThreadN 8 \
    --readFilesIn Sample01_S1_L001_R1_001.fastq.gz Sample01_S1_L001_R2_001.fastq.gz  \
    --readFilesCommand "gunzip -c" \
    --sjdbOverhang 100 \
    --outSAMtype BAM SortedByCoordinate \
    --twopassMode Basic \
    --limitBAMsortRAM 112000000000 \
    --limitOutSJcollapsed 1000000 \
    --outFileNamePrefix Sample01.

Software version is 2.7.3a on a CentOS 7.6.1810 OS.

Thank you very much in advance.

Cheers,
Santiago

The text was updated successfully, but these errors were encountered:

alexdobin · 2020-06-07T18:51:59Z

Hi Santiago,

at the mapping step, you do not need to specify --sjdbOverhang 100 , if annotations were already included at the genome generation step.

Cheers
Alex

santiagorevale · 2020-06-08T09:20:10Z

Hi Alex,

Thanks! So, for the current scenario where I'll be using 75, 100, and 150 pair-end reads, should I format the reference using --sjdbOverhang 150 and it'll work fine for all of them? As if I would have created a specific reference with a --sjdbOverhang of 75, 100, 150, for each of them?

Thanks again!

Cheers,
Santiago

alexdobin · 2020-06-08T14:42:45Z

Hi Santiago,

one fixed value of --sjdbOverhang will not work the same as a specific reference for each read length. The latter option, strictly speaking, is more accurate, but the actual effect is marginal. Nowadays I would recommend using 74 for the 75,100,150 read lengths samples.

Cheers
Alex

santiagorevale · 2020-06-08T15:11:47Z

Thanks for the clarification!

Cheers,
Santiago

zhangpeng1202 · 2022-08-14T02:09:51Z

Hi, Alex,

I noticed in your STARmanual, it says: " In case of reads of varying length, the ideal value is max(ReadLength)-1."
If I understand correctly, for the data including 75, 100, and 150 read lengths. The ideal --sjdbOverhang should be set to max(75, 100, 150)-1, which is 149?

Another quick question, I realize sometimes the RNAseq read length could be 76, 101, 151. Which --sjdbOverhang should we use in generating genome index? Should we generate genome index for 75, 76, 100, 101, 150, 151 separately, and using 74, 75, 99, 100, 149, 150 as --sjdbOverhang respectively?

Thanks very much!
Peng

alexdobin · 2022-08-16T17:10:03Z

Hi Peng,

using the "non-ideal" value of --sjdbOverhang leads only to marginal changes in most cases, so you could simply use the default value of 100.

alexdobin added the issue: usage label Jun 7, 2020

santiagorevale closed this as completed Jun 8, 2020

hsun3163 mentioned this issue Jul 11, 2022

Handle the sjdbOverhang parameter to accommodate the different read length per batch cumc/xqtl-protocol#318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Size for sjdbOverhang for multiple read length datasets #931

Size for sjdbOverhang for multiple read length datasets #931

santiagorevale commented Jun 5, 2020

alexdobin commented Jun 7, 2020

santiagorevale commented Jun 8, 2020

alexdobin commented Jun 8, 2020

santiagorevale commented Jun 8, 2020

zhangpeng1202 commented Aug 14, 2022

alexdobin commented Aug 16, 2022

Size for sjdbOverhang for multiple read length datasets #931

Size for sjdbOverhang for multiple read length datasets #931

Comments

santiagorevale commented Jun 5, 2020

alexdobin commented Jun 7, 2020

santiagorevale commented Jun 8, 2020

alexdobin commented Jun 8, 2020

santiagorevale commented Jun 8, 2020

zhangpeng1202 commented Aug 14, 2022

alexdobin commented Aug 16, 2022