A pipeline to quantify SL trans-splicing events from RNA-seq data
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
.gitignore
LICENSE
README.md
SL-quant.sh
SL_cutadapt.sh
SL_sites.sh
_config.yml
analyse_SL.R
map_reads.sh
map_reads_modENCODE.sh
set-up.sh

README.md

SL-quant

A pipeline to quantify SL trans-splicing events from RNA-seq data

SL-quant is a bash pipeline that adapts to paired-end and single-end RNA-seq data to accurately quantify splice-leader (SL) trans-splicing events by genes in the nematode C. elegans. It is designed to work downstream of read mapping and takes the reads left unmapped as primary input. SL-quant completes under 15 minutes on a basic desktop computer for typical RNA-seq libraries.

Detailed description and validation of the pipeline are reported in the manuscript published in Gigascience.

For support, questions or requests, please contact: carlo.yaguesanz@unamur.be

Installation & quick start

SL-quant comes as a simple bash script that works on macOS and Linux systems. However, the following dependencies need to be installed and set in your PATH:

There are many ways to install these but the easiest might be through a package manager such as brew.

Easy dependencies installation with homebrew

Install homebrew on macOS

Follow instructions on the brew homepage or simply paste this command in the Terminal:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Install linuxbrew on linux systems

Follow instructions on the linuxbrew homepage.

Install dependencies with brew (macOS and linux)

Paste the following commands in the Terminal:

brew update
brew install blast
brew install samtools
brew install picard-tools
brew install bedtools
brew install hisat2
brew install cutadapt
brew tap jrderuiter/tap
brew install subread

Setting-up and testing SL-quant

Clone SL-quant

To install SL-quant, clone this repository using git:

git clone https://github.com/cyaguesa/SL-quant

or alternatively, use the green downoad button at the top of the page.

Set-up SL-quant

To complete SL-quant set-up, use this script to create directories, make the other scripts executable, download the C. elegans genome sequence and build the indexes. This can take a few minutes.

cd SL-quant  # or 'cd ~/Desktop/SL-quant' if it is cloned on your desktop
chmod +x set-up.sh
./set-up.sh
Test SL-quant

We provide a small test dataset (1000 reads) for testing SL-quant in a few seconds.

time ./SL-quant.sh data/reads/test_unmapped.bam SL-quant_results/test
real	0m2.718s
user	0m3.235s
sys	0m0.589s
Basic usage
./SL-quant.sh -h

SL-quant identifies trans-splicing events as unmapped reads containing a splice leader (SL) sequence at the 5’ end of the read. In paired-end mode, the unmapped reads are pre-filtered based on the status of their mate (it must be mapped). In sensitive mode, the criteria to identify SL sequences are less stringent, increasing sensitivity.

Detailed instructions on github.com/cyaguesa/SL-quant

USAGE: ./SL-quant.sh [-p -m mapped.bam] [-s] unmapped.bam output_base

Required arguments:
  unmapped.bam
			file containing unmapped reads.
  output_base
			base name (+path) for the ouput.

Optional arguments:
  --mapped mapped.bam, -m mapped.bam
			file containing mapped reads.
  --paired, -p
			run SL-quant in paired-end mode. 
			Requires -m argument.
  --sensitive, -s
			run SL-quant in sensitive mode.
  --help, -h
			show this help message and exit.

Detailed usage

A word about the SL-quant input

To minimize SL-quant running time, its primary input is limited to reads susceptible to originate from trans-spliced RNA fragments, that is, unmapped reads in bam format. This implies that a first round of mapping to the C. elegans genome or transcriptome must precede the use of SL-quant. It must be performed end-to-end (without soft-clipping) in order to make sure that reads originating from trans-spliced RNA fragments do not map. This is the default behaviour of many mappers (bowtie2, tophat2, BBMap, …) but for others, such as STAR or HiSAT2, soft-clipping should be disabled. Beside this specification, any coordinate-sorted bam file containing unmapped reads can be fed into SL-quant, making it particularly well suited for posterior analysis of old data.

Basic paramaters

paired-end mode (-p --paired)

In the case paired-end data is available, we provide an optimized paired-end mode that requires an additional parameter -m --mapped, referring to the mapped reads of the library. These mapped reads will be used to prefilter the unmapped reads and to refine the mapping after the SL-sequence trimming.

Example: ./SL-quant.sh -p -m data/reads/test_mapped.bam data/reads/test_unmapped.bam SL-quant_results/test

sensitive mode (-s --sensitive)

In sensitive mode, the criteria to identify SL sequences are less stringent, increasing sensitivity (at the cost of some specificity). It can be used in combinaison with the paired-end mode or in single-end mode.

Example 1 : ./SL-quant.sh -s data/reads/test_unmapped.bam SL-quant_results/test

Example 2 : ./SL-quant.sh -p -s -m data/reads/test_mapped.bam data/reads/test_unmapped.bam SL-quant_results/test

Advanced paramaters

At the beginning of the SL-quant.sh file, there are a few additional parameters that can be modified for advanced users.

  • set -e This means that the script will stop at the first error (which is usually for the best).
  • SL_db="data/blast_db/SL.fasta" The path to SL sequence database for blast.
  • gene_annotation="data/genes.gtf"The annotation file for the summarization step.
  • index="data/ce10_hisat2_index/genome"The genome index for hisat2.
  • paired_orientation="FR"The read orientation for paired-end mode (ignored in single-end mode). Value={"FR" (default), "RF", "unstranded"}
  • single_orientation="R"The read orientation for single-end mode (ignored in paired-end mode). Value={"F" (stranded), "R" (reversely stranded), "unstranded"}
  • threads=4 The number of threads to use.
  • send=22 The minimal end of alignment in 'subject' threshold for blast (this should be equal to the length of the SL sequence).
  • align_length=5 The minimal length of the cutadapt alignment (default = 5).

Output files

Counts results

The final output file named [output_dir/base]_counts.txt is a tab delimited file generated by featureCounts containing the number of SL1 and SL2 trans-splicing events by genes. A summary of the quantification is available in the [output_dir/base]_counts.txt.summary file.

Intermediate files

Those files are generated during the SL-quant process. When the file name ends by SL1, there is a similar file ending with SL2 for the SL2-containing reads.

  • [output_dir/base]_blasted.txt countains the raw results of the blast of the unmapped reads to the SL sequences.
  • [output_dir/base]_blasted_SL1.txt countains blast results for the SL1-containing reads.
  • [output_dir/base]_SL1_IDs.txt countains the read IDs of the SL1-containing reads.
  • [output_dir/base]_SL1.sam countains the SL1-containing reads in SAM format (single-end mode only).
  • [output_dir/base]_SL1_trimmed.fq countains the trimmed SL1-containing reads in fastq format (single-end mode only).
  • [output_dir/base]_SL1_remapped.bam countains the trimmed SL1-containing reads mapped on the genome in bam format.
  • [output_dir/base]_oneEnd_unmapped.fasta countains unmapped reads (after prefiltering) in fasta format (paired-end mode only).
  • [output_dir/base]_oneEndMapped.bam countains the mates of the unmapped reads (after prefiltering) in bam format (paired-end mode only).
log file

[output_dir/base]_log.txt countains various warning/outputs generated during the SL-quant process.

Adaptation to other species

While SL-quant was developed for and tested on C.elegans data, many other species do SL-trans-splicing. The analysis of trans-splicing events in such species is possible with SL-quant with the following adaptations. Don't hesitate to contact us if you need any help implementing those changes.

Change the SL sequences in the blast database

1- Find the SL sequences (not the full SL RNA sequences, only the part that will be trans-spliced to the mRNA) and save it in a fasta file SL_my_species.fasta in the data/blast_db directory. You should include the characters "SL1" and/or "SL2" in the header of the fasta sequence.

2- Build the new blast database:

makeblastdb -dbtype nucl -in data/blast_db/SL_my_species

3- Replace the value of the SL_db parameter in the SL-quant.sh script by "data/blast_db/SL_my_species.fasta".

4- Replace the value of the send parameter in the SL-quant.sh script by the length of the new SL sequence(s).

Change the reference genome index.

1- Download the reference genome for your species of interest and save it as a fasta file genome_my_species.fa in a new data/index_my_species directory.

2- Build the hisat2 index:

hisat2-build data/index_my_species/genome_my_species.fa data/index_my_species/genome_my_species

3- Replace the value of the index parameter in the SL-quant.sh script by "data/index_my_species/genome_my_species.fa".

Change the gene annotation file.

1 - Download or create a .gtf file genes_my_species.gtf for your species and save it into the data directory.

2- Replace the value of the gene_annotation parameter in the SL-quant.sh script by data/genes_my_species.gtf.

3- Done ! Now you can use SL-quant with your new species !

Identification of SL trans-spliced sites

We designed SL-quant with the idea of quantifying SL trans-splicing events by genes but it is also possible to identify trans-spliced sites at single nucleotide resolution from the output. Indeed, in single-end mode, the 5' end of the reads mapped after SL sequence trimming correspond to the position of the trans-spliced sites. The folowing lines describe such analysis applied to the SL1 trans-splicing only.

sort remapped bam files

samtools sort SL-quant_results/test_SL1_remapped.bam -o SL-quant_results/test_SL1_remapped_sorted.bam

get 5' end positions of reads (strand-specific)

bedtools genomecov -ibam SL-quant_results/test_SL1_remapped_sorted.bam -dz -5 -strand + > SL-quant_results/test_TS_SL1_sites_fwd.tab
bedtools genomecov -ibam SL-quant_results/test_SL1_remapped_sorted.bam -dz -5 -strand - > SL-quant_results/test_TS_SL1_sites_fwd.tab

see transpliced site

head -n 1 SL-quant_results/test_TS_SL1_sites_fwd.tab
chrI	6789739     1   # 1 trans-splicing event at position 6789739 of strand '+' of chrI

Consensus

We also provide a script, SL_sites.sh, which automatize the analysis of the SL trans-spliced sites. In addition, it generates a fasta file of the -5 -> +5 interval around this splice that can be used to create a consensus sequence (with weblogo for instance). The proportion of sites on 'AG' consensus acceptor sequences is also outputed.

./SL_sites.sh SL-quant_results/test_SL1_remapped.bam
find number of SL-transplicing events by sites (not by genes): SL-quant_results/test_SL1_remapped.bam
[single-end data]
      number of SL-transplicing events detected: 49
      number of sites detected:       12
      number of sites which are 'AG' splice-sites: 11 (91.666 %)
  
head -n 2 SL-quant_results/test_SL1_remapped.bam.fasta
>I:11611-11622(+)
TTACAGTAAGC 

Reproduce the analysis from the manuscript.

To reproduce the full analysis presented in our manuscript from the raw data, R and trimmomatic should be installed. Trimmomatic can be installed with brew:

brew tap brewsci/bio
brew install trimmomatic

Get and map data

We provide two bash scripts to map the reads for the paired-end (map_reads.sh) and the single-end dataset (./map_reads_modENCODE.sh).

single-end dataset (modENCODE_4594)
mkdir data/reads/modENCODE
curl ftp://data.modencode.org/all_files/cele-raw-1/4594_SRR125481.fastq.gz -o data/reads/modENCODE/modENCODE_4594.fastq.gz
./map_reads_modENCODE.sh data/reads/modENCODE/modENCODE_4594.fastq.gz
paired-end dataset (SRR1585277)
mkdir data/reads/SRR1585277
curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR158/007/SRR1585277/SRR1585277_1.fastq.gz -o data/reads/SRR1585277/SRR1585277_1.fastq.gz
curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR158/007/SRR1585277/SRR1585277_2.fastq.gz -o data/reads/SRR1585277/SRR1585277_2.fastq.gz
./map_reads.sh
generate random reads (fasta -> fastq -> uBAM)
bedtools random -l 50 -seed 0 -n 1000000 -g data/chrom_summary.txt > data/reads/random.bed
bedtools getfasta -fi data/ce10_hisat2_index/genome.fa -bed data/reads/random.bed > data/reads/random.fa
awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"; for(c=0;c<length($2);c++) printf "J"; printf "\n"}' data/reads/random.fa > data/reads/random.fq
picard FastqToSam F1=data/reads/random.fq O=data/reads/random.bam SM=random

Run SL-quant

single-end dataset (modENCODE_4594)

In SL-quant single-end mode, normal and -s --specific settings. Also using the 'cutadapt' method.

./SL-quant.sh data/reads/modENCODE/modENCODE_4594/unmapped.bam SL-quant_results/modENCODE_4594
./SL-quant.sh -s data/reads/modENCODE/modENCODE_4594/unmapped.bam SL-quant_results/modENCODE_4594_s/
./SL_cutadapt.sh data/reads/modENCODE/modENCODE_4594/unmapped.bam SL-quant_results/modENCODE_4594_cutadapt/
paired-end dataset (SRR1585277)

In single and paired-end, normal and -s --specific settings. Also using the 'cutadapt' method.

./SL-quant.sh data/reads/SRR1585277/SRR1585277/unmapped.bam SL-quant_results/SRR1585277_single/
./SL-quant.sh -s data/reads/SRR1585277/SRR1585277/unmapped.bam SL-quant_results/SRR1585277_single_s/
./SL-quant.sh -p -m data/reads/SRR1585277/SRR1585277/accepted_hits_sorted.bam data/reads/SRR1585277/SRR1585277/unmapped.bam SL-quant_results/SRR1585277_paired/
./SL-quant.sh -s -p -m data/reads/SRR1585277/SRR1585277/accepted_hits_sorted.bam data/reads/SRR1585277/SRR1585277/unmapped.bam SL-quant_results/SRR1585277_paired_s/
./SL_cutadapt.sh data/reads/SRR1585277/SRR1585277/unmapped.bam_R2.bam SL-quant_results/SRR1585277_single_cutadapt/
random reads dataset

In SL-quant single-end mode, normal and -s --specific settings. Also using the 'cutadapt' method.

./SL-quant.sh data/reads/random.bam SL-quant_results/random/
./SL-quant.sh -s data/reads/random.bam SL-quant_results/random_s/
./SL_cutadapt.sh data/reads/random.bam SL-quant_results/random_cutadapt/

Analyse the data and make figures

An R script (analyse_SL.R) is provided to reproduce the analysis and figures presented in the manuscript.

In addition, another shell script is provided to generate statistics on the SL trans-spliced sites found with the various methods. First, as we want general information, we merge the remapped SL1 and SL2 bam files:

for i in $(ls SL-quant_results/*/*_log.txt | rev | cut -c 8- | rev | uniq)
  do
  samtools merge -f ${i}SL_merged_remapped.bam ${i}SL1_remapped.bam ${i}SL2_remapped.bam
  done

Then SL_sites.sh is run on the merged bam files. Alternatively, it can also be used directly on the SL1/SL2 specific bam files.

./SL_sites.sh SL-quant_results/*/*SL_merged_remapped.bam