Skip to content

dhtt/SpliceView

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Cite with Zenodo

Nextflow run with conda run with docker run with singularity Launch on Nextflow Tower

Introduction

zbi/spliceview is a bioinformatics best-practice analysis pipeline for Pipeline for extracting alignment information from a genomic regions for inspection of splicing events.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.

Pipeline summary

  1. Read QC (FastQC)
  2. Present QC for raw reads (MultiQC)
  3. Perform adapter/quality trimming on sequencing reads (https://cutadapt.readthedocs.io/en/stable/#)
  4. Create genome index for STAR alignment (https://github.com/alexdobin/STAR)
  5. Align reads using reference genome index (https://github.com/alexdobin/STAR)
  6. Generate alignment files in .bam format and its index in .bam.bai format for IGV

Quick Start

Step 1

Install Nextflow (>=22.10.1)

Step 2

Install any of Docker, Singularity (you can follow this tutorial), Podman, Shifter or Charliecloud for full pipeline reproducibility (you can use Conda both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs).

Step 3

Download the pipeline and test it on a minimal dataset with a single command (max CPUs = 20, max memory = 128.GB):

nextflow run <ABSOLUTE_PATH_TO_SPLICEVIEW_FOLDER> -profile test,docker  --outdir <ABSOLUTE_PATH_TO_RESULT_FOLDER>

or with minimum resources (max CPUs = 2, max memory = 8.GB)

nextflow run <ABSOLUTE_PATH_TO_SPLICEVIEW_FOLDER> -profile test_minimum,docker --outdir <ABSOLUTE_PATH_TO_RESULT_FOLDER>

Step 4

Start running your own analysis!

I. Check for pipeline requirements

1. Working directory setup
🏠 /home/max_mustermann/.........................home directory
┣ πŸ“¦ SpliceView..................................pipeline directory
┣ πŸ“¦ TEST........................................working directory
┃  ┣ πŸ—‚οΈ GENOMES..................................folder containing indexed genomes or reference genome FAST/GTF files
┃  ┃ β”— πŸ—‚οΈ mm10...................................mouse reference genome
┃  ┃ ┃ β”— πŸ—‚οΈ star.................................mouse genome index
┃  ┃ β”— πŸ—‚οΈ GRCh38.................................human reference genome
┃  ┃ ┃ ┃ πŸ“„ genome.fastq.gz......................FASTA file
┃  ┃ ┃ β”— πŸ“„ genome.gtf.gz........................GTF file
┃  ┃ ┃
┃  ┣ πŸ—‚οΈ INPUT....................................input folder containing all datasets
┃  ┃ β”— πŸ—‚οΈ testdata_1.............................input directory with reads in fastq.gz format
┃  ┃ ┃ ┣ πŸ“„ test1_1.fastq.gz
┃  ┃ ┃ β”— πŸ“„ test1_2.fastq.gz
┃  ┃ β”— πŸ—‚οΈ testdata_2.............................input directory for another dataset
┃  ┃ ┃ ┣ πŸ“„ test2_1.fastq.gz
┃  ┃ ┃ β”— πŸ“„ test2_2.fastq.gz
┃  ┃ ┃
┃  ┣ πŸ—‚οΈ OUTPUT...................................output folder for all datasets
┃  ┃ β”— πŸ—‚οΈ testdata_1.............................output directory for testdata_1
┃  ┃ ┃ ┣ πŸ—‚οΈ cutadapt.............................Cutadapt output
┃  ┃ ┃ ┣ πŸ—‚οΈ fastqc...............................FASTQC output
┃  ┃ ┃ ┣ πŸ—‚οΈ genomes..............................genomes index-related output
┃  ┃ ┃ ┃ β”— πŸ—‚οΈ mm10
┃  ┃ ┃ ┃ ┃ β”— πŸ—‚οΈ star.............................generated genome index
┃  ┃ ┃ ┣ πŸ—‚οΈ multiqc..............................MultiQC output
┃  ┃ ┃ ┃ ┣ πŸ—‚οΈ multiqc_data
┃  ┃ ┃ ┃ β”— πŸ“„ multiqc_report.html................MultiQC report
┃  ┃ ┃ ┣ πŸ—‚οΈ pipeline_info........................process's additional information
┃  ┃ ┃ ┣ πŸ—‚οΈ star_align_log.......................STAR alignment logs
┃  ┃ ┃ β”— πŸ—‚οΈ star_align_result....................STAR alignment output
┃  ┃ ┃ ┃ ┣  πŸ“„ test1_T1.Aligned.sortedByCoord.out.bam
┃  ┃ ┃ ┃ β”—  πŸ“„ test1_T1.Aligned.sortedByCoord.out.bam.bai
2. Mandatory arguments

--input

  • The full path to the folder where fastq-files are stored.
    Example: With the above folder structure, --input is /home/max_mustermann/TEST/INPUT/testdata_1
  • All fastq-files should be compressed and end with .gz. You can use gzip file.fastq command to compress a .fastq files.
  • For paired-end fastq-files in --input folder, forward reads must end with _1.fastq.gz and reverse reads must end with _2.fastq.gz.
    Example: : For a paired-end sample WT with 2 replicates, the files should be named: for
    • Replicate 1: WT_Rep1_1.fastq.gz and WT_Rep1_2.fastq.gz
    • Replicate 2: WT_Rep2_1.fastq.gz and WT_Rep2_2.fastq.gz

--outdir

  • The full path to the folder where all outputs and logs are stored.
    Example: With the above folder structure, --outdir is /home/max_mustermann/TEST/OUTPUT/testdata_1

--genome

  • The reference genome used for STAR genome indexing and STAR alignment.
    Example: The reference genome options for human is GRCh38 or GRCh37, and for mouse is mm10.

Note Defining --genome will download and use the reference genome from iGenome database. If you wish to use an existing version of the reference genome, please define --fasta and --gtf and do not include --genome in the command line. See here and here

--fasta

  • The reference genome FASTA file used for STAR genome indexing and STAR alignment UNLESS --genome is defined (see --genome).
    Example: The path to FASTA file in the above folder structure is /home/max_mustermann/TEST/GENOMES/GRCh38/genome.fastq.gz

Note --fasta must be declared together with --gtf

--gtf

  • The reference genome GTF file used for STAR genome indexing and STAR alignment UNLESS --genome is defined (see --genome).
    Example: The path to GTF file in the above folder structure is /home/max_mustermann/TEST/GENOMES/GRCh38/genome.gtf.gz

Note --gtf must be declared together with --fasta. See --fasta

-profile

  • Use docker as default, unless Singularity or other \

Warning There is only one hyphen (-) in front of this parameter, while all other require two hyphens (--)

3. Optional arguments

--star_index

  • Path to the folder containing a prebuilt/generated genome index. This parameter can be used when a specific genome index has been created successfully from a previous run.
  • Using --star_index speeds up the process significantly as genome indexing step requires extensive time and memory (For test data, --star_index can reduce run time from 1 hour to 5 minutes).

Example: The path to genome index in the above folder structure is /home/max_mustermann/TEST/GENOMES/mm10/star. This genome index is generated from previous run using the 'mm10' mouse reference genome, which is intially stored in /home/max_mustermann/TEST/OUTPUT/testdata_1/genomes/mm10/star

Note It is highly recommended to copy the genome index to a folder such as /home/max_mustermann/TEST/GENOMES/ once it is generated successfully from a run for reusing purpose.

--genome must be defined when --star_index is used

--extra_star_align_args

  • Extra arguments to pass to STAR alignment that can be found here
    Example: --outSAMtype BAM SortedByCoordinate --readFilesCommand gunzip -c --limitGenomeGenerateRAM=124544990592

--fastq_dir_to_samplesheet_args

  • Extra arguments to pass to fastq_dir_to_samplesheet.py to prepare samplesheet for Nextflow pipeline that can be found here
    Example: --single_end true --recursive true

--max_cpus

  • Number of maximum CPUs that can be assigned to the process. Default: --max_cpus 2 in test_minimum profile; --max_cpus 20 in test profile

--max_memory

  • Maximum memory that can be assigned to the process. Default: --max_memory 8.GB in test_minimum profile; --max_memory 128.GB in test profile

II. Run pipeline

1. OPTION 1

Download and use FASTA/GTF reference genome files from iGenome for genome indexing:

nextflow run <ABSOLUTE_PATH_TO_SPLICEVIEW_FOLDER>\  # /home/max_mustermann/Spliceview
   --input <ABSOLUTE_PATH_TO_FASTQ_FILES_FOLDER>\   # /home/max_mustermann/TEST/INPUT/testdata_1
   --outdir <ABSOLUTE_PATH_TO_RESULT_FOLDER>\       # /home/max_mustermann/TEST/OUTPUT/testdata_1
   --genome <NAME_OF_REFERENCE_GENOME>\             # mm10
   -profile docker

Warning

Please make sure there is no empty space behind the slash ( \ ) at the end of each line and remove the comments (#comment)

Note --genome must be defined for downloading the reference genome from iGenome database

2. OPTION 2

Use self-defined/existing FASTA/GTF reference genome files for genome indexing:

nextflow run <ABSOLUTE_PATH_TO_SPLICEVIEW_FOLDER>\  # /home/max_mustermann/Spliceview
   --input <ABSOLUTE_PATH_TO_FASTQ_FILES_FOLDER>\   # /home/max_mustermann/TEST/INPUT/testdata_1
   --outdir <ABSOLUTE_PATH_TO_RESULT_FOLDER>\       # /home/max_mustermann/TEST/OUTPUT/testdata_1
   --fasta <ABSOLUTE_PATH_TO_FASTA_FILE>\           # /home/max_mustermann/TEST/GENOMES/GRCh38/genome.fastq.gz
   --gtf <ABSOLUTE_PATH_TO_GTF_FILE>\               # /home/max_mustermann/TEST/GENOMES/GRCh38/genome.gtf.gz
   -profile docker

Warning

Please make sure there is no empty space behind the slash ( \ ) at the end of each line and remove the comments (#comment)

Note --fasta and --gtf must be defined while --genome is not provided

3. OPTION 3

Use a previously generated genome index and skip STAR indexing (less time-consuming):

nextflow run <ABSOLUTE_PATH_TO_SPLICEVIEW_FOLDER>\     # /home/max_mustermann/Spliceview
   --input <ABSOLUTE_PATH_TO_FASTQ_FILES_FOLDER>\      # /home/max_mustermann/TEST/INPUT/testdata_1
   --outdir <ABSOLUTE_PATH_TO_RESULT_FOLDER>\          # /home/max_mustermann/TEST/OUTPUT/testdata_1
   --genome <NAME_OF_REFERENCE_GENOME>\                # mm10
   --star_index <ABSOLUTE_PATH_TO_STAR_INDEX_FOLDER>\  # /home/max_mustermann/TEST/GENOMES/mm10/star
   -profile docker

Warning

Please make sure there is no empty space behind the slash ( \ ) at the end of each line and remove the comments (#comment)

Note --genome must be defined when --star_index is used

III. PIPELINE RESULTS

The ouputs include the following folders: \

  • cutadapt: Cutadapt output including trimmed reads and report are stored in this folder.\
  • fastqc: FastQC output for generated reads\
  • genomes: Indexed reference genome by STAR that can be reused for another run with different datasets. The index is stored in genomes/<NAME_OF_GENOME>/star folder\
  • multiqc: MultiQC final report is stored here in .html format\
  • pipeline_info: Additional information about the current run\
  • star_align_log: Additional information about the STAR alignment\
  • star_align_result: Main results of the pipeline are stored in in .BAM and .BAI format

IV. TROUBLESHOOTING

⭐️ The pipeline could be re-run with modified parameters reduce the runtime of the new process. Simply follow these steps:\

  1. Run the command to get a report of recent processes
nextflow log
OUTPUT:
TIMESTAMP           DURATION    RUN NAME            STATUS  REVISION ID SESSION ID                              COMMAND                   
2023-06-14 12:20:47     -               furious_mcnulty         -       e1508873c9      8855cf37-826f-4a31-b960-e44d3a881954    nextflow run ./SpliceView --outdir ./TEST/OUTPUT/est_data1 -profile docker,test
2023-06-14 12:24:22     2m 45s          desperate_blackwell     OK      68da96b1f7      94f82489-c6a5-41ac-a6a7-9058436b1089    nextflow run ./SpliceView --outdir ./TEST/OUTPUT/est_data1 -profile docker,test
  1. Add the -resume argument to the new command line with the session ID of the process you wish to resume.
    Example:
nextflow run <ABSOLUTE_PATH_TO_SPLICEVIEW_FOLDER> -profile test,docker  --outdir <ABSOLUTE_PATH_TO_RESULT_FOLDER> -resume 8855cf37-826f-4a31-b960-e44d3a881954

⭐️ Depends on user's resources, the number of maximum CPUs and maximum memory can be adjusted. The default in test profile uses 20 CPUs and 128GB memory, while test_minimum uses 2 CPUs and 8GB memory. Users can manually adjust these parameters by adding --max_cpus and --max_memory arguments in the command line.

⭐️ Some extra STAR-alignment arguments must be adjusted depending on available memory for successful run. For example --limitGenomeGenerateRAM (see issue). To add extra arguments to STAR alignment, see --extra_star_align_args

Credits

zbi/spliceview was originally written by Trang Do.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published