Permalink
Fetching contributors…
Cannot retrieve contributors at this time
307 lines (214 sloc) 18.5 KB

Enrichment module

The circtools enrichment module was implemented in order to combine circRNA data with positional data from other experiments, for example the eCLIP approach. The setup uses detected circRNA back splice junctions and the corresponding location of circRNAs within the genome to test if positional features (e.g. eCLIP peaks) are significantly enriched within a circRNA compared to the remaining linear host gene.

img/enrich_workflow.png

This may give hints to potential RBP sponge functions of circRNAs when they show significant enrichment of eCLIP peaks in the circRNA portion of the host gene.

Background

Input

The circtools enrich module requires generally three types of input data:

  • circRNA coordinates, e.g. from the circtools package itself or any other BED6-formatted circRNA list.
  • a BED6-formatted file with coordinates of features of interest. E.g. RNA binding protein binding sites or any other sequence-based features that can be condensed into genomic coordinates
  • A genome in FASTA format as well as a genome annotation in GTF format. Circtools works well and was tested with ENSEMBL-based genomes and annotations

How does it work

In a first step the 'observed' distribution of features throughout the supplied circRNAs is calculated. This observed distribution is used as a baseline in the subsequent 'iteration' step. By employing the bedtools shuffle command the features are randomly distributed throughout the genome while keep the number and length of all features constant. After several hundred or thousand randomized iterations circtools counts the number of iterations in which more hits within the defined list of circRNAs are observed than in the initial, actual experimental observation. Circtools than computes the probability that a given number of hits is significantly higher than the simulated random distribution obtained by the random shuffling. The test is carried out for the circRNA and the corresponding host genes, therefore also allowing to distinguish between features enriched in the circRNA and possibly depleted in the circRNA host gene.

Required tools and packages

circtools enrich depends on bedtools as well as R and a some R packages for visualization purposes.

R packages:

  • ggplot2
  • ggrepel
  • data.table
  • reshape2
  • plyr
  • gridExtra

Python libraries:

  • pybedtools >= 0.7.10
  • statsmodels>=0.8.0

Note

The enrichment circtools module as well as all R dependencies are automatically installed during the circtools installation procedure.

General usage

A call to circtools enrich --help shows all available command line flags:

usage: circtools [-h] -c CIRC_RNA_INPUT -b BED_INPUT -a ANNOTATION -g
                 GENOME_FILE [-o OUTPUT_DIRECTORY] [-i NUM_ITERATIONS]
                 [-p NUM_PROCESSES] [-t TMP_DIRECTORY] [-T THRESHOLD]
                 [-P PVAL] [-F OUTPUT_FILENAME] [-I INCLUDE_FEATURES]
                 [-k KEEP_TEMP]

circular RNA RBP enrichment tools

optional arguments:
  -h, --help            show this help message and exit

Required options:
  -c CIRC_RNA_INPUT, --circ-file CIRC_RNA_INPUT
                        Path to the CircRNACount file generated by DCC
  -b BED_INPUT, --bed-input BED_INPUT
                        One or more BED files containing features to overlap
  -a ANNOTATION, --annotation ANNOTATION
                        Genome reference annotation file used to not shuffle
                        into intragenic regions
  -g GENOME_FILE, --genome GENOME_FILE
                        Genome file for use with bedtools shuffle. See
                        bedtools man page for details.

Additional options:
  -o OUTPUT_DIRECTORY, --output OUTPUT_DIRECTORY
                        The output folder for files created by circtest
                        [default: .]
  -i NUM_ITERATIONS, --iterations NUM_ITERATIONS
                        Number of iterations for CLIP shuffling [default:
                        1000]
  -p NUM_PROCESSES, --processes NUM_PROCESSES
                        Number of threads to distribute the work to
  -t TMP_DIRECTORY, --temp TMP_DIRECTORY
                        Temporary directory used by pybedtools
  -T THRESHOLD, --threshold THRESHOLD
                        p-value cutoff
  -P PVAL, --pval PVAL  p-value cutoff
  -F OUTPUT_FILENAME, --output-filename OUTPUT_FILENAME
                        Defines the output file prefix [default: output]
  -I INCLUDE_FEATURES, --include-features INCLUDE_FEATURES
                        Defines the the features which should be used for
                        shuffling. May be specified multiple times. [default:
                        all - shuffle over the whole genome]
  -k KEEP_TEMP, --keep-temp KEEP_TEMP
                        Keep temporary files created by circtools/bedtools
                        [default: no]

Generating necessary input data files

In addition to input data produced by the detection and reconstruct module, the enrichment module requires a few processing steps. For our example we employ the circRNAs detected in the murine heart and are interested in possible enrichment of repeat in the flanking intron of those circRNAs. Therefore, as a first step the flanking introns need to be compiled from the circRNA coordinates provided by the detect module.

circtools_generate_flanking_introns.py -g /scratch/tjakobi/circtools_workflow/genes_and_introns.gtf -d /scratch/tjakobi/circtools_workflow/workflow/circtools/01_detect/CircCoordinates > /scratch/tjakobi/circtools_workflow/murine_flanking_introns.bed

Additionally, the shuffling algorithm of bedtools requires knowledge of chromosome sizes. For the mouse genome, a sample file for those length can be easily downloaded:

wget https://data.dieterichlab.org/s/mm10_chrom_sizes/download -O mm10.chrom.sizes

In order to provide a working example of reasonable size we do not use the full set of repeats as provided by the UCSC genome browser but only the 3 most-common ones, i.e. :

  • AT_rich
  • B3
  • RSINE1
wget https://data.dieterichlab.org/s/repeat_selection_mm10/download -O repeat_selection_mm10.tar.bz2
tar -jxvf repeat_selection_mm10.tar.bz2

After unzipping the downloaded file, the folder repeats/ contains BED files with coordinates of the three aforementioned repeat categories. Those files will be used as input in the next step. The circtools enrich module is able to work with arbitrary features of a GTF annotation file. However, our aim is to search in introns for enrichment and introns are not part of normal ENSEMBL GTF annotation files. circtools includes a script that easily converts ENSEMBL GTF files in GTF files enriched with intron information.

mkdir 06_enrich/
cd 06_enrich/

# download build 90 annotation
wget ftp://ftp.ensembl.org/pub/release-90/gtf/mus_musculus/Mus_musculus.GRCm38.90.gtf.gz

# unzip and add introns
gzip -d Mus_musculus.GRCm38.90.gtf.gz
circtools_generate_intron_gtf.sh Mus_musculus.GRCm38.90.gtf

The resulting files, genes_and_introns.gtf will now serve as replacement for the standard ENSEMBL annotation in the module call.

Calling the reconstruct module via wrapper

cd 06_enrich/

# download wrapper for STAR
wget https://raw.githubusercontent.com/dieterich-lab/bioinfo-scripts/master/slurm_circtools_enrich_intron.sh
chmod 755 slurm_circtools_enrich_intron.sh

parallel -j1 slurm_circtools_enrich_intron.sh /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/mm10.chrom.sizes /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/genes_and_introns.gtf /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/repeats/{}.bed /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/murine_flanking_introns.bed {} /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/output/ 2000 /scratch/global_tmp/{}/ :::: /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/repeats/repeats_selected.list

Manual module call

Below a sample single call of circtools enrich without using the wrapper script:

circtools enrich -c /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/murine_flanking_introns.bed -b /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/repeats/AT_rich.bed -a /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/genes_and_introns.gtf -g /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/mm10.chrom.sizes -i 2000 -I intron -p 20 -P 1 -T 1 -o /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/output// -F AT_rich -t /scratch/global_tmp/AT_rich/

This call to circtools enrich will produce output similar to the one shown below. The run time depends on the size of the circRNA dataset as well as the number of peaks used for the analysis.

Command line output

2018-09-17 11:13:16,166 circtools 1.1.0.6 started
2018-09-17 11:13:16,166 circtools command line: /home/tjakobi//.local/bin/circtools enrich -c /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/murine_flanking_introns.bed -b
 /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/repeats/AT_rich.bed -a /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/genes_and_introns.gtf -g /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/mm10.chrom.sizes -i 2000 -I intron -p 20 -P 1 -T 1 -o /scratch/tjakobi/circtools_workflow/workflow/circtools/06_enrich/output// -F AT_rich -t /scratch/global_tmp/AT_rich//
2018-09-17 11:13:16,177 bedtools v2.27.1 detected
2018-09-17 11:13:16,177 Parsing annotation...
2018-09-17 11:13:17,864 Found 256488 entries
2018-09-17 11:13:17,865 Done parsing annotation
2018-09-17 11:13:20,126 Parsing BED input file...
2018-09-17 11:13:21,207 Done parsing BED input file:
2018-09-17 11:13:21,207 => 228756 peaks, 33 nt average width
2018-09-17 11:13:21,207 Parsing annotation...
2018-09-17 11:13:21,727 Found 52636 entries
2018-09-17 11:13:21,728 Done parsing annotation
2018-09-17 11:13:22,777 Parsing circular RNA input file...
2018-09-17 11:13:22,787 Done parsing circular RNA input file:
2018-09-17 11:13:22,788 => 2522 circular RNAs, 1801 nt average (theoretical unspliced) length
2018-09-17 11:13:23,057 Starting random shuffling of input peaks
2018-09-17 11:13:23,059 Processing shuffling thread 1
2018-09-17 11:13:23,059 Processing shuffling thread 26
[output cut]
2018-09-17 11:35:14,025 Permutation test iteration 1998
2018-09-17 11:35:14,043 Permutation test iteration 1991
2018-09-17 11:35:14,172 Permutation test iteration 2000
2018-09-17 11:35:14,198 Permutation test iteration 1993
2018-09-17 11:35:14,221 Permutation test iteration 1995
2018-09-17 11:35:14,381 Permutation test iteration 1997
2018-09-17 11:35:14,578 Permutation test iteration 1999
2018-09-17 11:35:17,547 Cleaning up... just a second
2018-09-17 11:35:18,740 Cleaning up temporary files
2018-09-17 11:35:20,552 Deleting /scratch/global_tmp/AT_rich/pybedtools.knohds5y.tmp
2018-09-17 11:35:20,553 Deleting /scratch/global_tmp/AT_rich/pybedtools.j0mwk09_.tmp
2018-09-17 11:35:20,553 Deleting /scratch/global_tmp/AT_rich/pybedtools.85vjrbnw.tmp
2018-09-17 11:35:20,553 Deleting /scratch/global_tmp/AT_rich/pybedtools.jli7p0je.tmp
2018-09-17 11:35:20,553 Deleting /scratch/global_tmp/AT_rich/pybedtools.4vz84ujl.tmp
2018-09-17 11:35:20,560 Done

Output produced by circtools enrich

*.csv

The generated CSV file is the main output of circtools enrich. It contains the data generated during the run and has the following fields:

  • circRNA_host_gene: Name of the circRNA host gene
  • chr: Chromosome location of the circRNA
  • start: Absolute circRNA start location
  • stop: Absolute circRNA end location
  • strand: Strand of the circRNA
  • p-val_circular: p-value for the enrichment of peaks within the given circRNA
  • raw_count_circ_rna: How many simulated peaks have been counted
  • observed_input_peaks_circ_rna: How many real, experimental peaks have been observed
  • length_circ_rna: Length of the circRNA
  • length_normalized_count_circ_rna: Lengt-normalized count of observed peaks
  • number_of_features_intersecting_circ: How many featured are interesecting the circRNA (only -i)
  • circ_rna_confidence_interval_0.05: 0.05% confidence interval for the circRNA test
  • p-val_linear: p-value for the enrichment of peaks within the linear host gene excluding the circRNA portion
  • raw_count_host_gene: How many simulated peaks have been counted
  • observed_input_peaks_host_gene: How many real, experimental peaks have been observed
  • length_host_gene_without_circ_rna: Length of the host gene minus the circRNA length
  • length_normalized_count_host_gene: Lengt-normalized count of observed peaks
  • number_of_features_intersecting_linear: How many featured are interesecting the host gene (only -i)
  • host_gene_confidence_interval_0.05: 0.05% confidence interval for the linear test
  • distance_normalized_counts: Distance between the length-normalized counts of linear host gene and circRNA
circRNA_host_gene chr start stop strand p-val_circular raw_count_circ_rna observed_input_peaks_circ_rna length_circ_rna length_normalized_count_circ_rna number_of_features_intersecting_circ circ_rna_confidence_interval_0.05 p-val_linear raw_count_host_gene observed_input_peaks_host_gene length_host_gene_without_circ_rna length_normalized_count_host_gene number_of_features_intersecting_linear host_gene_confidence_interval_0.05 distance_normalized_counts
RERE 1 8539213 8541213
0.0005 1 1 2000 0.5 1 (1.265882386853128e-05, 0.0027826398346596504) 0 0 59 450423 0 37 (nan, 0.0018427397934069074) -0.5

*.bed

The generated BED files are holding the temporary annotation data created by circtools enrich.

  • .gtf_features.bed: Contains one row for each feature extract from the supplied genome annotatation (only in if -i is used)
  • .gtf_genes.bed: Conatins one for for each gene entry from the supplied genome annotation
  • .bed_circles.bed: Contains all supplied CircRNAs in BED format

*.log

The log file generated by circtools enrich.

Additional graphical visualization

Circtools is bundled with an additional R-script to post-process the raw data of the enrichment module. In order to be used with the visualization script, the enrich data has to be slightly preprocessed. The visualization script is designed to work with multiple sets of peaks, i.e. multiple different eCLIP data sets.

Pre-processing

In case of the example used throughout this tutorial, the experiment has been performed with multiple repeat classes and the following command may be used to merge all runs into one CSV file with a new first column that contains the file name of the originating run (assuming 2000 iterations):

awk -F '\t' '{{print FILENAME"\t"$0}}' *.csv | sed 's/_.*2000_.*.csv//g' | grep -v circRNA_host_gene > ../results.csv

Subsequently, the transformed data file may be used for visualization:

Plotting

circtools_enrich_visualization.R results.csv 0.05 10 10 results.pdf "Results" colour False

Visualized results

Visualization of the results generated by the enrichment module for the Jakobi 2016 data set. The is sample experiment looks at enriched repeats in the flanking introns of the detect circRNAs.

1) Top 3 repeat categories with enrichment within the flanking introns (max. +/- 2kb) of the significantly enriched circRNA candidates

img/results-0.png

2) Number of circular RNAs per repeat category (p<0.05)

img/results-1.png

3) Detailed view of the repeat "peaks" enriched in the flanking introns of isoform 1 of circRyr2

img/results-2.png