Analysis of circRNA and miRNA sponging.
This pipeline was implemented by Octavia Ciora as part of her Advanced Lab Course Bioinformatics under the supervision of Dr. Markus List. It was extended by Leon Schwartz as part of his Bachelor's Thesis under the supervision of Markus Hoffmann and Dr. Markus List.
nf-core/circrnasponging is a pipeline for the systematical analysis of circRNAs, their differential expression, miRNA sponging activities and ceRNA network functions. It requires total RNA and small RNA sequencing data.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
By default, the pipeline currently performs the following:
- circRNA analysis:
- totalRNA back-splicing-junction (BSJ) mapping (using
STAR) - circRNA detection (using
circExplorer2) - aggregation of detected circRNAs over all samples, normalization and filtering
- linear and circular quantification (using
psircandkallisto) - extraction of fasta sequences for filtered circRNAs
- database annotation (using
circBase)
- totalRNA back-splicing-junction (BSJ) mapping (using
- differential expression analysis (using
DESeq2):- linear RNAs
- circular RNAs
- Heatmaps, Volcano and PCA plots
- miRNA analysis:
- smallRNA read mapping and quantification (using
miRDeep2) - aggregation of identified miRNAs over all samples, normalization and filtering
- smallRNA read mapping and quantification (using
- circRNA-miRNA binding sites:
- binding sites processing and filtering (
miRanda) TarPmiRPITA- majority vote (2/3) for downstream analyses
- binding sites processing and filtering (
- sponging analyses:
- Correlation analysis between circRNA and miRNA expression (using
miRanda) - mRNA-circRNA-miRNA networks (using majority vote with
SPONGE) - ceRNA network analysis (using
SPONGEpackage extensionspongEffects) - Analysis of networks and plotting
- Correlation analysis between circRNA and miRNA expression (using
This analysis is based on both totalRNA (or rRNA-depleted) and smRNA data coming from the same samples. It is recommended to run this analysis with a minimum of 5 samples. The exact input format and how to get the needed reference files is described below. In order to run the sponging analysis on a dataset using our pipeline, the data has to meet strict requirements. At least 5 samples from the same organism are needed, because computing the correlation between 4 or less samples is unreasonable. For each sample, both total RNA (or rRNA-depleted) and small RNA data should be available. Further instructions regarding the exact input format, reference files and configuration are explained below.
Pull pipeline from github using:
git clone https://github.com/biomedbigdata/circRNA-sponging.git
The following options are mandatory for executing the full workflow of the pipeline:
BASIC OPTIONS:
--samplesheet [path/to/sampleseet.tsv]
--outdir [path/to/output_directory]
--genome [string] # genome version of RNA-seq data, GRCh38 for human, etc.
--transcriptome [string] # path to transcriptome
-profile [configuration_profile] # docker, singularity, slurm, etc.
Run the pipeline using the following command:
nextflow run
The pipeline requires a tab-separated samplesheet file containing the sample names and the paths to the corresponding read files in fastq.gz format. By default, the totalRNA sequencing data is considered to be paired-end. If the total RNA data is single-end, the instructions below should be followed. The samplesheet should be a tab-separated file following the structure:
sample | totalRNA1 | totalRNA2 | smallRNA
-----------|----------------------------------------|---------------------------------------|-------------------------------------
sample1 | path/to/<totalRNA_sample1_R1>.fastq.gz | path/to/<totalRNA_sample1_R2>.fastq.gz| path/to/<smallRNA_sample1>.fastq.gz
sample2 | path/to/<totalRNA_sample2_R1>.fastq.gz | path/to/<totalRNA_sample2_R2>.fastq.gz| path/to/<smallRNA_sample2>.fastq.gz
sample3 | path/to/<totalRNA_sample3_R1>.fastq.gz | path/to/<totalRNA_sample3_R2>.fastq.gz| path/to/<smallRNA_sample3>.fastq.gz
... | ... | ... | ...
A linear transcriptome has to be supplied with:
--transcriptome "path/to/transcriptome"
Other reference files are supported through iGenomes and automatically downloaded (see conf/igenomes.config for details) (https://emea.support.illumina.com/sequencing/sequencing_software/igenome.html). These files can also be given manually by the user with these parameters:
totalRNA files:
--bed12
--fasta
--gtf
smallRNA files:
--miRNA_fasta
--hairpin_fasta
--miRNA_related_fasta
--bowtie_index
--STAR_index
mature and hairpin miRNA sequences for the organism used in the analysis, e.g. mouse and mature miRNA sequences from related species, e.g. rat and human. Instructions forgenerating the miRNA reference files can be found in the miRDeep2 tutorial.
Conserning miRNA adapter trimming, the pipeline will use the default adapter for Illumina sequencing, which can be altered by supplying one of the following parameters.
--miRNA_adapter [adapter_sequence] # miRNA adapter used for trimming
OR
--protocol [sequencing_protocol] # protocol used in smallRNA sequencing, e.g. illumina, cat, etc.
The pipeline offers the option to skip the miRNA quantification step, if this has been done in advance and the raw read counts are available. In this case, the pipeline performs the read mapping and quantification only for circRNAs. It is recommended to use nf-core/smarnaseq pipeline for the quantification of miRNAs. The tabulated raw read counts can be passed directly to the pipeline using the following parameter:
--miRNA_raw_counts [path/to/miRNA_raw_counts.tsv]
The file should be tab-separated and follow the structure shown below. The header should contain the same sample names as defined in the samplesheet. The last column called smallRNA in the samplesheet should have the value ’NA’ among all samples. When running the pipeline in this scenario, the following parameters are no longer required:
--species, --miRNA_adapter, --mature_other_fasta, --hairpin_fasta
miRNA |sample1|sample2| ...
--------------|-------|-------|---------
mmu-let-7a-5p | 53 | 0 | ...
mmu-let-7b-3p | 37 | 93 | ...
... | ... | ... | ...
In case the total RNA sequencing data is single-end, the following additional parameter should be used:
--single_end
default: false
true total RNA sequencing data is single-end
false total RNA sequencing data is paired-end
In addition, the samplesheet should be adapted to match the following structure:
sample | totalRNA1 | smallRNA
-----------|-------------------------------------|------------------------------------
sample1 | path/to/<totalRNA_sample1>.fastq.gz | path/to/<smallRNA_sample1>.fastq.gz
sample2 | path/to/<totalRNA_sample2>.fastq.gz | path/to/<smallRNA_sample2>.fastq.gz
sample3 | path/to/<totalRNA_sample3>.fastq.gz | path/to/<smallRNA_sample3>.fastq.gz
... | ... | ...
psirc requires the input of a linear RNA transcriptome. It is recommended to use an Ensembl transcriptome as it is default for generating kallisto indices (available under https://ftp.ensembl.org/pub/). Linear and circular quantification is enabled by default but can be switched off by:
--quantification false
Database annotation is enabled by default, but can be turned off using:
--database_annotation false
To exclusively analyse annotated circRNAs use the switch:
--annotated_only true
To enable differential expression, set
--differential_expression true
and add condition labels to samplesheet:
sample | totalRNA1 | smallRNA | condition |
-----------|-------------------------------------|------------------------------------ |-----------|
sample1 | path/to/<totalRNA_sample1>.fastq.gz | path/to/<smallRNA_sample1>.fastq.gz | treated |
sample2 | path/to/<totalRNA_sample2>.fastq.gz | path/to/<smallRNA_sample2>.fastq.gz | control |
sample3 | path/to/<totalRNA_sample3>.fastq.gz | path/to/<smallRNA_sample3>.fastq.gz | treated |
... | ... | ... | ... |
SPONGE also requires condition labeling (see Differential expression). To perform SPONGE analysis include:
--sponge true
Additionally one can select the miRNA binding sites tools to use from miranda, TarPmiR and PITA.
Per default all three methods are enabled, but TarPmiR and PITA can be disabled by:
--tarpmir false
--pita false
The majority vote function only works with all three methods enabled. Otherwise, remaining binding sites are simply merged.
SPONGE analysis also requires miRNA-mRNA binding sites for which the pipeline will, per default, use the precomputed data from MirWalk3.0 Human combined with lncBase. Custom linearRNA-miRNA binding sites can be provided using:
--target_scan_symbols path/to/targets.tsv
Format specification:
| miRNA-1 | miRNA-2 | miRNA-3 | ...
-----------|----------|---------|---------|
ENSG..1 | 2 | 3 | 8 |
ENSG..2 | 3 | 0 | 1 |
ENSG..3 | 1 | 3 | 5 |
...
To execute the pipeline, run the following command:
nextflow run circRNAsponging/ -c my.config -profile [singularity/docker],cluster
where my.config is a configuration file specifying parameters and execution settings. An example configuration file is shown below:
params {
samplesheet = "path/to/sampleseet.tsv"
outdir = "path/to/outdir"
species = "mmu"
protocol = "illumina"
transcriptome = "path/to/transcriptome.fa.gz"
}
profiles {
standard {
process.executor = 'local'
}
cluster {
executor.queueSize = 20
process.executor = 'slurm'
process.cpu = '8'
process.memory = '50 GB'
}
}
After normalizing the raw read counts for both circRNAs and miRNAs, the pipeline filters out entries which have a low expression level. By default, only entries having at least 5 reads in at least 20% of samples are used in the downstream analysis. The parameter values can be changed with the options:
--read_threshold
default: 5
real >= 0 read counts under this threshold are considered to be low expressed
--sample_percentage
default: 0.2
0<= real <=1 minimum percentage of samples that should have no low expression
Color palettes from R MetBrewer can be set by:
--palette
TarPmiR prediction cutoff score can be set by:
--p
The output folder is structured as shown below. The circRNA/miRNA results for each sample are stored in the samples folder. The tabulated circRNA and miRNA counts summarized over all samples located in results/circRNA and results/miRNA, respectively. The results and plots of the sponging analysis are stored in the subfolder results/sponging.
├─── output_folder
│ ├─── samples
| │ ├─── sample_1
| | | |─── circRNA_detection
| | | └─── miRNA_detection
| │ ├─── sample_2
| | | |─── circRNA_detection
| | | └─── miRNA_detection
| │ ├─── ...
│ ├─── results
| | |─── circRNA
| | | |─── circRNA_counts_raw.tsv
| | | └─── circRNA_counts_filtered.tsv
| | | └─── circRNA_counts_annotated.tsv
| | |─── psirc
| | | |─── psirc.index
| | | └─── quant_circ_expression.tsv
| | | └─── quant_linear_expression.tsv
| | | └─── TPM_map.tsv
| | | └─── quant_effects.png
| | |─── miRNA
| | | |─── miRNA_counts_raw.tsv
| | | └─── miRNA_counts_normalized.tsv.tsv
| | | └─── miRNA_counts_filtered.tsv
| | |─── differential_expression
| | | └─── totalRNA
| | | | └─── total_rna.tsv
| | | | └─── volcano.png
| | | | └─── nDE.png
| | | | └─── HMAP.png
| | | └─── circRNA_DE
| | | | └─── circRNA_DE.tsv
| | | | └─── volcano.png
| | | | └─── nDE.png
| | | | └─── HMAP.png
| | | └─── DESeq2.RData
| | |─── binding_sites
| | | └─── PITA
| | | └─── TarPmiR
| | | └─── miRanda
| | | └─── majority.tsv.gz
| | |─── sponging
| | | |─── sponging_statistics.txt
| | | |─── filtered_circRNA_miRNA_correlation.tsv
| | | |─── plots
| | | | └─── top1_circRNA_miRNA.png
| | | | └─── ...
| | | └─── SPONGE
| | | | └─── plots
| | | | └─── circRNA
| | | | └─── total
| | | | └─── RData
| | | | └─── spongEffects
└── └── └──
