Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size for sjdbOverhang for multiple read length datasets #931

Closed
santiagorevale opened this issue Jun 5, 2020 · 6 comments
Closed

Size for sjdbOverhang for multiple read length datasets #931

santiagorevale opened this issue Jun 5, 2020 · 6 comments

Comments

@santiagorevale
Copy link

Hi Alex,

I was wondering which value should I specify when creating the reference for the --sjdbOverhang parameter if I want to use the same reference index against 2x75, 2x101, and 2x151 bp long datasets.

Following the documentation, it says to use max(ReadLength)-1m which in my case was 150, but then when I try to run the alignment for one of the shorter read length datasets, I get the error:

Jun 05 11:28:04 ..... started STAR run
Jun 05 11:28:04 ..... loading genome

EXITING because of fatal PARAMETERS error: present --sjdbOverhang=100 is not equal to the value at the genome generation step =150
SOLUTION: 

Jun 05 11:28:04 ...... FATAL ERROR, exiting

I'm creating the reference in with the following command:

STAR \
    --runMode genomeGenerate \
    --genomeDir STARindex \
    --genomeFastaFiles genome.fa \
    --sjdbGTFfile genes.gtf \
    --sjdbOverhang 150 \
    --runThreadN 24

and I'm running the aligner with the following command:

STAR \
    --genomeDir STARindex \
    --runThreadN 8 \
    --readFilesIn Sample01_S1_L001_R1_001.fastq.gz Sample01_S1_L001_R2_001.fastq.gz  \
    --readFilesCommand "gunzip -c" \
    --sjdbOverhang 100 \
    --outSAMtype BAM SortedByCoordinate \
    --twopassMode Basic \
    --limitBAMsortRAM 112000000000 \
    --limitOutSJcollapsed 1000000 \
    --outFileNamePrefix Sample01.

Software version is 2.7.3a on a CentOS 7.6.1810 OS.

Thank you very much in advance.

Cheers,
Santiago

@alexdobin
Copy link
Owner

Hi Santiago,

at the mapping step, you do not need to specify --sjdbOverhang 100 , if annotations were already included at the genome generation step.

Cheers
Alex

@santiagorevale
Copy link
Author

Hi Alex,

Thanks! So, for the current scenario where I'll be using 75, 100, and 150 pair-end reads, should I format the reference using --sjdbOverhang 150 and it'll work fine for all of them? As if I would have created a specific reference with a --sjdbOverhang of 75, 100, 150, for each of them?

Thanks again!

Cheers,
Santiago

@alexdobin
Copy link
Owner

Hi Santiago,

one fixed value of --sjdbOverhang will not work the same as a specific reference for each read length. The latter option, strictly speaking, is more accurate, but the actual effect is marginal. Nowadays I would recommend using 74 for the 75,100,150 read lengths samples.

Cheers
Alex

@santiagorevale
Copy link
Author

Thanks for the clarification!

Cheers,
Santiago

@zhangpeng1202
Copy link

Hi, Alex,

I noticed in your STARmanual, it says: " In case of reads of varying length, the ideal value is max(ReadLength)-1."
If I understand correctly, for the data including 75, 100, and 150 read lengths. The ideal --sjdbOverhang should be set to max(75, 100, 150)-1, which is 149?

Another quick question, I realize sometimes the RNAseq read length could be 76, 101, 151. Which --sjdbOverhang should we use in generating genome index? Should we generate genome index for 75, 76, 100, 101, 150, 151 separately, and using 74, 75, 99, 100, 149, 150 as --sjdbOverhang respectively?

Thanks very much!
Peng

@alexdobin
Copy link
Owner

Hi Peng,

using the "non-ideal" value of --sjdbOverhang leads only to marginal changes in most cases, so you could simply use the default value of 100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants