zbi/spliceview is a bioinformatics best-practice analysis pipeline for Pipeline for extracting alignment information from a genomic regions for inspection of splicing events.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.
- Read QC (
FastQC) - Present QC for raw reads (
MultiQC) - Perform adapter/quality trimming on sequencing reads (https://cutadapt.readthedocs.io/en/stable/#)
- Create genome index for STAR alignment (https://github.com/alexdobin/STAR)
- Align reads using reference genome index (https://github.com/alexdobin/STAR)
- Generate alignment files in .bam format and its index in .bam.bai format for IGV
Install Nextflow (>=22.10.1)
Install any of Docker, Singularity (you can follow this tutorial), Podman, Shifter or Charliecloud for full pipeline reproducibility (you can use Conda both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs).
Download the pipeline and test it on a minimal dataset with a single command (max CPUs = 20, max memory = 128.GB):
nextflow run <ABSOLUTE_PATH_TO_SPLICEVIEW_FOLDER> -profile test,docker --outdir <ABSOLUTE_PATH_TO_RESULT_FOLDER>or with minimum resources (max CPUs = 2, max memory = 8.GB)
nextflow run <ABSOLUTE_PATH_TO_SPLICEVIEW_FOLDER> -profile test_minimum,docker --outdir <ABSOLUTE_PATH_TO_RESULT_FOLDER>Start running your own analysis!
π /home/max_mustermann/.........................home directory
β£ π¦ SpliceView..................................pipeline directory
β£ π¦ TEST........................................working directory
β β£ ποΈ GENOMES..................................folder containing indexed genomes or reference genome FAST/GTF files
β β β ποΈ mm10...................................mouse reference genome
β β β β ποΈ star.................................mouse genome index
β β β ποΈ GRCh38.................................human reference genome
β β β β π genome.fastq.gz......................FASTA file
β β β β π genome.gtf.gz........................GTF file
β β β
β β£ ποΈ INPUT....................................input folder containing all datasets
β β β ποΈ testdata_1.............................input directory with reads in fastq.gz format
β β β β£ π test1_1.fastq.gz
β β β β π test1_2.fastq.gz
β β β ποΈ testdata_2.............................input directory for another dataset
β β β β£ π test2_1.fastq.gz
β β β β π test2_2.fastq.gz
β β β
β β£ ποΈ OUTPUT...................................output folder for all datasets
β β β ποΈ testdata_1.............................output directory for testdata_1
β β β β£ ποΈ cutadapt.............................Cutadapt output
β β β β£ ποΈ fastqc...............................FASTQC output
β β β β£ ποΈ genomes..............................genomes index-related output
β β β β β ποΈ mm10
β β β β β β ποΈ star.............................generated genome index
β β β β£ ποΈ multiqc..............................MultiQC output
β β β β β£ ποΈ multiqc_data
β β β β β π multiqc_report.html................MultiQC report
β β β β£ ποΈ pipeline_info........................process's additional information
β β β β£ ποΈ star_align_log.......................STAR alignment logs
β β β β ποΈ star_align_result....................STAR alignment output
β β β β β£ π test1_T1.Aligned.sortedByCoord.out.bam
β β β β β π test1_T1.Aligned.sortedByCoord.out.bam.bai
--input
- The full path to the folder where fastq-files are stored.
Example: With the above folder structure,--inputis /home/max_mustermann/TEST/INPUT/testdata_1 - All fastq-files should be compressed and end with .gz. You can use
gzip file.fastqcommand to compress a .fastq files. - For paired-end fastq-files in
--inputfolder, forward reads must end with _1.fastq.gz and reverse reads must end with _2.fastq.gz.
Example: : For a paired-end sample WT with 2 replicates, the files should be named: for- Replicate 1: WT_Rep1_1.fastq.gz and WT_Rep1_2.fastq.gz
- Replicate 2: WT_Rep2_1.fastq.gz and WT_Rep2_2.fastq.gz
--outdir
- The full path to the folder where all outputs and logs are stored.
Example: With the above folder structure,--outdiris /home/max_mustermann/TEST/OUTPUT/testdata_1
--genome
- The reference genome used for STAR genome indexing and STAR alignment.
Example: The reference genome options for human is GRCh38 or GRCh37, and for mouse is mm10.
Note Defining
--genomewill download and use the reference genome from iGenome database. If you wish to use an existing version of the reference genome, please define--fastaand--gtfand do not include--genomein the command line. See here and here
--fasta
- The reference genome FASTA file used for STAR genome indexing and STAR alignment UNLESS
--genomeis defined (see--genome).
Example: The path to FASTA file in the above folder structure is /home/max_mustermann/TEST/GENOMES/GRCh38/genome.fastq.gz
Note
--fastamust be declared together with--gtf
--gtf
- The reference genome GTF file used for STAR genome indexing and STAR alignment UNLESS
--genomeis defined (see--genome).
Example: The path to GTF file in the above folder structure is /home/max_mustermann/TEST/GENOMES/GRCh38/genome.gtf.gz
Note
--gtfmust be declared together with--fasta. See--fasta
-profile
- Use docker as default, unless Singularity or other \
Warning There is only one hyphen (-) in front of this parameter, while all other require two hyphens (--)
--star_index
- Path to the folder containing a prebuilt/generated genome index. This parameter can be used when a specific genome index has been created successfully from a previous run.
- Using
--star_indexspeeds up the process significantly as genome indexing step requires extensive time and memory (For test data,--star_indexcan reduce run time from 1 hour to 5 minutes).
Example: The path to genome index in the above folder structure is /home/max_mustermann/TEST/GENOMES/mm10/star. This genome index is generated from previous run using the 'mm10' mouse reference genome, which is intially stored in /home/max_mustermann/TEST/OUTPUT/testdata_1/genomes/mm10/star
Note It is highly recommended to copy the genome index to a folder such as /home/max_mustermann/TEST/GENOMES/ once it is generated successfully from a run for reusing purpose.
--genomemust be defined when--star_indexis used
--extra_star_align_args
- Extra arguments to pass to STAR alignment that can be found here
Example: --outSAMtype BAM SortedByCoordinate --readFilesCommand gunzip -c --limitGenomeGenerateRAM=124544990592
--fastq_dir_to_samplesheet_args
- Extra arguments to pass to fastq_dir_to_samplesheet.py to prepare samplesheet for Nextflow pipeline that can be found here
Example: --single_end true --recursive true
--max_cpus
- Number of maximum CPUs that can be assigned to the process. Default: --max_cpus 2 in
test_minimumprofile; --max_cpus 20 intestprofile
--max_memory
- Maximum memory that can be assigned to the process. Default: --max_memory 8.GB in
test_minimumprofile; --max_memory 128.GB intestprofile
Download and use FASTA/GTF reference genome files from iGenome for genome indexing:
nextflow run <ABSOLUTE_PATH_TO_SPLICEVIEW_FOLDER>\ # /home/max_mustermann/Spliceview
--input <ABSOLUTE_PATH_TO_FASTQ_FILES_FOLDER>\ # /home/max_mustermann/TEST/INPUT/testdata_1
--outdir <ABSOLUTE_PATH_TO_RESULT_FOLDER>\ # /home/max_mustermann/TEST/OUTPUT/testdata_1
--genome <NAME_OF_REFERENCE_GENOME>\ # mm10
-profile dockerWarning
Please make sure there is no empty space behind the slash ( \ ) at the end of each line and remove the comments (#comment)
Note
--genomemust be defined for downloading the reference genome from iGenome database
Use self-defined/existing FASTA/GTF reference genome files for genome indexing:
nextflow run <ABSOLUTE_PATH_TO_SPLICEVIEW_FOLDER>\ # /home/max_mustermann/Spliceview
--input <ABSOLUTE_PATH_TO_FASTQ_FILES_FOLDER>\ # /home/max_mustermann/TEST/INPUT/testdata_1
--outdir <ABSOLUTE_PATH_TO_RESULT_FOLDER>\ # /home/max_mustermann/TEST/OUTPUT/testdata_1
--fasta <ABSOLUTE_PATH_TO_FASTA_FILE>\ # /home/max_mustermann/TEST/GENOMES/GRCh38/genome.fastq.gz
--gtf <ABSOLUTE_PATH_TO_GTF_FILE>\ # /home/max_mustermann/TEST/GENOMES/GRCh38/genome.gtf.gz
-profile dockerWarning
Please make sure there is no empty space behind the slash ( \ ) at the end of each line and remove the comments (#comment)
Note
--fastaand--gtfmust be defined while--genomeis not provided
Use a previously generated genome index and skip STAR indexing (less time-consuming):
nextflow run <ABSOLUTE_PATH_TO_SPLICEVIEW_FOLDER>\ # /home/max_mustermann/Spliceview
--input <ABSOLUTE_PATH_TO_FASTQ_FILES_FOLDER>\ # /home/max_mustermann/TEST/INPUT/testdata_1
--outdir <ABSOLUTE_PATH_TO_RESULT_FOLDER>\ # /home/max_mustermann/TEST/OUTPUT/testdata_1
--genome <NAME_OF_REFERENCE_GENOME>\ # mm10
--star_index <ABSOLUTE_PATH_TO_STAR_INDEX_FOLDER>\ # /home/max_mustermann/TEST/GENOMES/mm10/star
-profile dockerWarning
Please make sure there is no empty space behind the slash ( \ ) at the end of each line and remove the comments (#comment)
Note
--genomemust be defined when--star_indexis used
The ouputs include the following folders: \
cutadapt: Cutadapt output including trimmed reads and report are stored in this folder.\fastqc: FastQC output for generated reads\genomes: Indexed reference genome by STAR that can be reused for another run with different datasets. The index is stored in genomes/<NAME_OF_GENOME>/star folder\multiqc: MultiQC final report is stored here in .html format\pipeline_info: Additional information about the current run\star_align_log: Additional information about the STAR alignment\star_align_result: Main results of the pipeline are stored in in .BAM and .BAI format
βοΈ The pipeline could be re-run with modified parameters reduce the runtime of the new process. Simply follow these steps:\
- Run the command to get a report of recent processes
nextflow logOUTPUT:
TIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID COMMAND
2023-06-14 12:20:47 - furious_mcnulty - e1508873c9 8855cf37-826f-4a31-b960-e44d3a881954 nextflow run ./SpliceView --outdir ./TEST/OUTPUT/est_data1 -profile docker,test
2023-06-14 12:24:22 2m 45s desperate_blackwell OK 68da96b1f7 94f82489-c6a5-41ac-a6a7-9058436b1089 nextflow run ./SpliceView --outdir ./TEST/OUTPUT/est_data1 -profile docker,test- Add the
-resumeargument to the new command line with the session ID of the process you wish to resume.
Example:
nextflow run <ABSOLUTE_PATH_TO_SPLICEVIEW_FOLDER> -profile test,docker --outdir <ABSOLUTE_PATH_TO_RESULT_FOLDER> -resume 8855cf37-826f-4a31-b960-e44d3a881954βοΈ Depends on user's resources, the number of maximum CPUs and maximum memory can be adjusted. The default in test profile uses 20 CPUs and 128GB memory, while test_minimum uses 2 CPUs and 8GB memory. Users can manually adjust these parameters by adding --max_cpus and --max_memory arguments in the command line.
βοΈ Some extra STAR-alignment arguments must be adjusted depending on available memory for successful run. For example --limitGenomeGenerateRAM (see issue). To add extra arguments to STAR alignment, see --extra_star_align_args
zbi/spliceview was originally written by Trang Do.
If you would like to contribute to this pipeline, please see the contributing guidelines.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.