GitHub - dieterich-lab/FUCHS: FUCHS - FUll circle CHaracterization from rna-Seq

FUCHS - FUll circular RNA CHaracterization from RNA-Seq

FUCHS is a python pipeline designed to fully characterize circular RNAs. It uses a list of circular RNAs and reads spanning the back-splice junction as well as a BAM file containing the mapping of all reads (alternatively of all chimeric reads).

The reads from one circle are extracted by FUCHS and saved in an individual BAM file. Based on these BAM files, FUCHS will detect alternative splicing within the same circle boundaries, summarize different circular isoforms from the same host-gene and generates coverage plots for each circle. It will also cluster circles based on their coverage profile. These results can be used to identify potential false positive circles.

Installation

FUCHS dependes on bedtools (>= 2.25.0) and samtools (>= 1.3.1) and Python (> 2.7; pysam>=0.9.1.4, pybedtools>=0.7.8, numpy>=1.11.2, pathos>=0.2.1) and R(> 3.2.0; amap, Hmisc, gplots). All Python an R dependencies will be installed automatically when installing FUCHS. Please make sure to have the correct versions of bedtools and samtools in your $PATH.

Clone the repository and install FUCHS using setup.py:

$ git clone git@github.com:dieterich-lab/FUCHS.git

$ cd FUCHS

$ python setup.py install --user

# This will install a FUCHS binary in $HOME/.local/bin/
# make sure this folder is in your $PATH

# Check the installation:

$ FUCHS --help

Usage

To characterize circRNAs from RNA-seq data you have to:

Map RNAseq data from quality checked fastq files with either STAR , BWA, TopHat-Fusion.
Detect circRNAs using DCC, CIRI, CIRCfinder or CIRCexplorer depending on the program you used for mapping.
Run FUCHS (right now only the combination STAR + DCC has been tested; other setups are under development)

Step by step tutorial

In this tutorial we will be using HEK293 data available in this repository and use STAR with DCC to detect circular RNAs

1. Mapping of RNA-Seq data

Map RNA-seq data with STAR (Dobin et al., 2013). Note that --alignSJoverhangMin and --chimJunctionOverhangMin should use the same value, to make the circRNA expression and linear gene expression level comparable. Note that STARlong is not mapping chimeric reads correctly.

Note: The joined pair mapping should be run first. If the data are paired end, two additional separate mate mappings are recommended This step is not mandatory, but will increase the sensitivity of DCC detection, because it collect small circRNAs which appear with one chimeric junction point at each read mate. If the data is single end, only one mapping step is needed. In this case, PE sequencing data was used.

$ STAR --readFilesCommand zcat --runThreadN 18
       --genomeDir [genome]
       --outSAMtype BAM SortedByCoordinate
       --readFilesIn [sample]_1.fastq.gz ([sample]_2.fastq.gz)
       --outFileNamePrefix [sample]
       --quantMode GeneCounts
       --genomeLoad NoSharedMemory
       --outReadsUnmapped Fastx
       --outSJfilterOverhangMin 15 15 15 15
       --alignSJoverhangMin 15
       --alignSJDBoverhangMin 10
       --outFilterMultimapNmax 20
       --outFilterScoreMin 1
       --outFilterMismatchNmax 999
       --outFilterMismatchNoverLmax 0.05
       --outFilterMatchNminOverLread 0.7
       --alignIntronMin 20
       --alignIntronMax 1000000
       --alignMatesGapMax 1000000
       --chimSegmentMin 15
       --chimScoreMin 15
       --chimScoreSeparation 10
       --chimJunctionOverhangMin 15
       --twopassMode Basic
       --alignSoftClipAtReferenceEnds No
       --outSAMattributes NH HI AS nM NM MD jM jI XS
       --sjdbGTFfile [annotation].gtf

1.1. Mates separate mapping (optional for PE data)

Note: the mate assignments should be consistent throughout the mapping and circular RNA detection process. In the following case, SamplePairedRead_1.fastq.gz is the first mate which also was the first mate in the STAR call.

# remap unmapped reads as single end to obtain double breakpoint fragments

$ gzip sample/Unmapped.out.mate1
$ mv sample/Unmapped.out.mate1.gz sample/Unmapped_out_mate1.fastq.gz
$ STAR --readFilesCommand zcat --runThreadN 18 --genomeDir [genome] --outSAMtype BAM SortedByCoordinate --readFilesIn [sample]/Unmapped_out_mate1.fastq.gz --outFileNamePrefix [sample].mate1.  --quantMode GeneCounts --genomeLoad NoSharedMemory --outReadsUnmapped Fastx --outSJfilterOverhangMin 15 15 15 15 --alignSJoverhangMin 15 --alignSJDBoverhangMin 10 --outFilterMultimapNmax 20 --outFilterScoreMin 1   --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.05 --outFilterMatchNminOverLread 0.7 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000  --chimSegmentMin 15  --chimScoreMin 15   --chimScoreSeparation 10  --chimJunctionOverhangMin 15 --twopassMode Basic --alignSoftClipAtReferenceEnds No --outSAMattributes NH HI AS nM NM MD jM jI XS  --sjdbGTFfile [annotation].gtf

$ gzip sample/Unmapped.out.mate2
$ mv sample/Unmapped.out.mate2.gz sample/Unmapped_out_mate2.fastq.gz
$ STAR --readFilesCommand zcat --runThreadN 18 --genomeDir [genome] --outSAMtype BAM SortedByCoordinate --readFilesIn [sample]/Unmapped_out_mate2.fastq.gz --outFileNamePrefix [sample].mate2.  --quantMode GeneCounts --genomeLoad NoSharedMemory --outReadsUnmapped Fastx --outSJfilterOverhangMin 15 15 15 15 --alignSJoverhangMin 15 --alignSJDBoverhangMin 10 --outFilterMultimapNmax 20 --outFilterScoreMin 1   --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.05 --outFilterMatchNminOverLread 0.7 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000  --chimSegmentMin 15  --chimScoreMin 15   --chimScoreSeparation 10  --chimJunctionOverhangMin 15 --twopassMode Basic --alignSoftClipAtReferenceEnds No --outSAMattributes NH HI AS nM NM MD jM jI XS  --sjdbGTFfile [annotation].gtf

2. Detection of circRNAs from chimeric.out.junction files with DCC

Acquiring suitable GTF files for repeat masking

It is strongly recommended to specify a repetitive region file in GTF format for filtering.
A suitable file can for example be obtained through the UCSC table browser . After choosing the genome, a group like Repeats or Variation and Repeats has to be selected. For the track, we recommend to choose RepeatMasker together with Simple Repeats and combine the results afterwards.
Note: the output file needs to comply with the GTF format specification. Additionally it may be the case that the names of chromosomes from different databases differ, e.g. 1 for chromosome 1 from ENSEMBL compared to chr1 for chromosome 1 from UCSC. Since the chromosome names are important for the correct functionality of DCC a sample command for converting the identifiers may be sed -i 's/^chr//g' your_repeat_file.gtf

Preparation of files containing the paths to required `chimeric.out.junction` files

samplesheet file, containing the paths to the jointly mapped chimeric.out.junction files

$ cat samplesheet
/path/to/STAR/sample/joint_mapping/chimeric.out.junction

mate1 file, containing the paths to chimeric.out.junction files of the separately mapped first read of paired-end data

$ cat mate2
/path/to/STAR/sample.mate1/joint_mapping/chimeric.out.junction

mate2 file, containing the paths to chimeric.out.junction files of the separately mapped first read of paired-end data

$ cat mate2
/path/to/STAR/sample.mate2/joint_mapping/chimeric.out.junction

Running DCC

After performing all preparation steps DCC can now be started:

# Call DCC to detect circRNAs, using HEK293 data as example.

$ DCC @samplesheet \ # @ is generally used to specify a file name
      -mt1 @mate1 \ # mate1 file containing the mate1 independently mapped chimeric.junction.out files
      -mt2 @mate2 \ # mate2 file containing the mate1 independently mapped chimeric.junction.out files
      -D \ # run in circular RNA detection mode
      -R [Repeats].gtf \ # regions in this GTF file are masked from circular RNA detection
      -an [Annotation].gtf \ # annotation is used to assign gene names to known transcripts
      -Pi \ # run in paired independent mode, i.e. use -mt1 and -mt2
      -F \ # filter the circular RNA candidate regions
      -M \ # filter out candidates from mitochondrial chromosomes
      -Nr 2 2 \ minimum number of replicates the candidate is showing in [1] and minimum count in the replicate [2]
      -fg \ # candidates are not allowed to span more than one gene
      -G \ # also run host gene expression
      -A [Reference].fa \ # name of the fasta genome reference file; must be indexed, i.e. a .fai file must be present

# For details on the parameters please refer to the help page of DCC:
$ DCC -h

Notes:

By default, DCC assumes that the data is stranded. For non-stranded data the -N flag should be used.
Although not mandatory, we strongly recommend to the -F filtering step

Output files generated by DCC

The output of DCC consists of the following four files: CircRNACount, CircCoordinates, LinearCount and CircSkipJunctions.

CircRNACount: a table containing read counts for circRNAs detected. First three columns are chr, circRNA start, circRNA end. From fourth column on are the circRNA read counts, one sample per column, shown in the order given in your samplesheet.
CircCoordinates: circular RNA annotations in BED format. The columns are chr, start, end, genename, junctiontype (based on STAR; 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT), strand, circRNA region (startregion-endregion), overall regions (the genomic features circRNA coordinates interval covers).
LinearCount: host gene expression count table, same setup with CircRNACount file.
CircSkipJunctions: circSkip junctions. The first three columns are the same as in LinearCount/CircRNACount, the following columns represent the circSkip junctions found for each sample. circSkip junctions are given as chr:start-end:count, e.g. chr1:1787-6949:10. It is possible that for one circRNA multiple circSkip junctions are found due to the fact the the circular RNA may arise from different isoforms. In this case, multiple circSkip junctions are delimited with semicolon. A 0 implies that no circSkip junctions have been found for this circRNA.

3. Prepare input data for FUCHS

The files chimeric.sam, mate1.chimeric.sam, and mate2.chimeric.sam files for FUCHS have to be merged (not necessary if circles were detected using BWA/CIRI)

# convert SAM to BAM
$ samtools view -Sb -o sample sample/Chimeric.out.sam
$ samtools view -Sb -o sample.1 sample.1/Chimeric.out.sam
$ samtools view -Sb -o sample.2 sample.2/Chimeric.out.sam

# sort both BAM files
$ samtools sort -o sample.sorted.bam sample.bam
$ samtools sort -o sample.1.sorted.bam sample.1.bam
$ samtools sort -o sample.2.sorted.bam sample.2.bam

# create an index for both BAM files
$ samtools index sample.sorted.bam
$ samtools index sample.1.sorted.bam
$ samtools index sample.2.sorted.bam

# merge both mate BAM files into one new BAM file
$ samtools merge merged_sample.bam sample.sorted.bam sample.1.sorted.bam sample.2.sorted.bam

# re-index the newly aggregated BAM file
$ samtools index merged_sample.bam

4. Running FUCHS

Run FUCHS to start the pipeline which will extract reads, check mate status, detect alternative splicing events, classify different isoforms, generate coverage profiles and cluster circRNAs based on coverage profiles

# using STAR/DCC Input
$ FUCHS -r 2 -q 2 -p ensembl -e 2 -T ~/tmp
        -D CircRNACount
        -J sample/Chimeric.out.junction
        -F sample.1/Chimeric.out.junction
        -R sample.2/Chimeric.out.junction.fixed
        -B merged_sample.sorted.bam
        -A [annotation].bed
        -N sample

# if BWA/CIRI was used, use -C to specify the circIDS list (omit -D, -J, -F and -R)
# For details on the parameters please refer to the help page of FUCHS:
$ FUCHS --help

5. Optional FUCHS modules

Run the additional module guided_denovo_circle_structure_parallel.py to obtain a more refined circle reconstruction based on intron signals. The circRNA seperated bamfiles (step 2) are the only input needed for the module. If you supply an annotation file, unsupported exons will be reported with a score of 0, if you do not supply an annotation file, unsupported will not be reported.

$ guided_denovo_circle_structure_parallel -c 18 -A [annotatation].bed -I FUCHS/output/folder -N sample

# FUCHS/output/folder corresponds to the output directory of the FUCHS pipeline
# sample corresponds to your sample name, just as specified for the pipeline

That's all folks

Required input data

circIDs:

circID	read1,read2,read3
1:3740233\|3746181	MISEQ:136:000000000-ACBC6:1:2107:10994:20458,MISEQ:136:000000000-ACBC6:1:1116:13529:8356
1:8495063\|8614686	MISEQ:136:000000000-ACBC6:1:2118:9328:9926

The first column contains the circle id formated as folllowed chr:start|end. The second column is a comma separated list of read names spanning the back-splice junction.

bamfile: Alignment file produced by any mapper. This file must contain all chimerically mapped reads and may contain also linearly mapped reads.

bedfile:

Chr	Start	End	Name	Strand
1	67092175	67093604	NR_075077_exon_0_0_chr1_67092176_r	-
1	67096251	67096321	NR_075077_exon_1_0_chr1_67096252_r	-
1	67103237	67103382	NR_075077_exon_2_0_chr1_67103238_r	-

Normal BED file in BED6 format. The name should contain a gene name or gene ID and the exon_number. You can specify how the name should be processed using -p (platform), -s (character used to separate name and exon number) and -e (exon_index).

Output produced by FUCHS

hek293.alternative_splicing.txt:

This file summarizes the relationship of different circRNAs derived from the same host-gene.

Transcript	circles	same_start	same_end	overlapping	within
NM_016287	1:20749723-20773610	.	.	.	.
NM_005095	1:35358925-35361789,1:35381259-35389082,1:35381259-35390098	1:35381259-35389082\|1:35381259-35390098,	.	.	.
NM_001291940	1:236803428-236838599,1:236806144-236816543	.	.	.	1:236803428-236838599\|1:236806144-236816543,

Transcript: Transcript name as defined by the bed-annotation file

circles: Comma-separated list of circRNA ids derived from this transcript

same_start: Comma-seprated list of circRNA pairs separated by |. Pairs in this column share the same start coordinates. A "." indicates that there are no circle pairs that share the same start coordinates.

same_end: Same as same_start, only now, circle pairs share the same end coordinates.

overlapping: Comma-seprated list of circRNA pairs separated by |. Pairs in this column share neither start nor end coordinates, but their relation is such that: start.x < start.y && end.x < end.y && start.y < end.x

within: Same as overlapping, only now, circle pairs have the follwoing relation: start.x < start.y && end.x > end.y

hek293.exon_counts.bed: This file is a bed-formatted file that describes the exon-structure and can be loaded into any genome browser. Each line corresponds to a circRNA.

Chr	Circle Start	Circle End	Transcript	Num of Reads	Strand	Start	End	Color	Num of Exon	Exon Lengths	Relative Exon Starts
chr1	35358925	35361789	NM_005095	9	+	35358925	35361789	0,255,0	3	521,61,170	0,2269,2694
chr1	20749723	20773610	NM_016287	4	-	20749723	20773610	0,255,0	4	159,90,143,159	0,7443,21207,23728

Chr: Chromosome of circRNA

Circle Start: The 5' site of the chimeric junction. This is relative to the reference strand, i.e. start < end! The location is 1-index based

Cirlce End: The 3' site of the chimeric junction. This is relative to the reference strand, i.e. start < end! The location is 0-index based

Transcript: Transcript name as defined by the bed-annotation file

Num of Reads : Number of reads supporting this chimeric junction, in other words, reads that are chimerically mapped to this junction

Strand: Strand of the host-gene

Start: Copied Circle Start to stay conform with BED12 format

End: Copied Circle End to stay conform with BED12 format

Color: pre defined color the exons will show up in the genome viewer (0,255,0 -> green)

Num of Exon: Number of exons in this circRNA consists of

Exon Lengths: Comma-seprated list of the length of each exon

Relative Exon Starts: Comma-separated list of the relative starting positions of the exons within the circle boundaries.

hek293.exon_counts.txt: This file contains similar information as the previous file, just more detailed inforamtion on the exons. Each line corresponds to one exon.

sample	circle_id	transcript_id	other_ids	exon_id	chr	start	end	strand	exon_length	unique_reads	fragments	number+	number-
hek293	1:35358925-35361789	NM_005095	NM_005095	2	1	35358924	35359446	+	522	9	9	4	5
hek293	1:35358925-35361789	NM_005095	NM_005095	3	1	35361193	35361255	+	62	3	3	1	2
hek293	1:35358925-35361789	NM_005095	NM_005095	4	1	35361618	35361789	+	171	9	9	4	5
hek293	1:20749723-20773610	NM_016287	NM_016287	3	1	20749722	20749882	-	160	4	4	4	0
hek293	1:20749723-20773610	NM_016287	NM_016287	4	1	20757165	20757256	-	91	1	1	1	0
hek293	1:20749723-20773610	NM_016287	NM_016287	5	0	0	0	0	0	0	0	0	0
hek293	1:20749723-20773610	NM_016287	NM_016287	6	0	0	0	0	0	0	0	0	0
hek293	1:20749723-20773610	NM_016287	NM_016287	7	1	20770929	20771073	-	144	1	1	1	0
hek293	1:20749723-20773610	NM_016287	NM_016287	8	1	20773450	20773610	-	160	4	4	4	0

sample: Sample name as specified by the user. This is useful if the user wants to merge files from different samples

circle_id: circRNA-ID. The circleID is formatted to be copy and pasted to a genome browser for easy access

transcript_id: Transcript name as defined by the bed-annotation file. This is the best fitting transcript. i.e. the splicing variants that contains the most exons that are actually covered

other_ids: Alternative Transcript names that are either just as fitting, or contain more or less exons as supported by reads

exon_id: Exon number relative to the host-gene of the circularized exon. One circle may have more than one exon. These will be listed as consecutive lines

chr: Chromosome the circRNA is located on

start: 5' start of the exon, relative to the reference strand, 0-based

end: 3' end of the exon, relative to the reference start, 0-based

strand: Strand of the host-gene

exon_length: Length of the current exon

unique_reads: Number of unique reads associated with the chimeric junction. When the data is paired end, then both ends are considered as separate reads.

fragments: Number of broken fragments aligning to the circle

number+: Number of reads spanning the chimeric junction on the forward strand

number-: Number of reads spanning the chimeric junction on the reverse strand (if reads are only from one strand, it could indicate, that there is a sequencing bias.)

hek293.mate_status.txt: This output file contains the results of analysing the amount of how often each fragment spans a chimeric junction. A fragment can either span the chimeric junction once (single), only one end spans the junction, twice (double) both ends span the chimeric junction, or more than twice (undefined).

circle_id	transcript_ids	num_reads	min_length	max_length	single	double	undefined
1_20749723_20773610	NM_016287	4	790	790	4	0	0
1_35358925_35361789	NM_005095	9	754	754	9	0	0

circle_id:

transcript_ids:

num_reads:

min_length:

max_length:

single:

double:

undefined:

hek293.skipped_exons.bed:

Chr	Circle-Start	Circle-End	Transcript	Ratio	Strand	Intron-Start	Intron-End	Color	NumExon-2	IntronLength	RelativeStart
chr5	178885614	178931326	NM_030613	60.0	.	178913072	178931236	255,0,0	3	1,146,1	0,30950,45711
chr6	161034259	161049979	NM_001291958	40.0	.	161049332	161049852	255,0,0	3	1,520,1	0,15073,15719

hek293.skipped_exons.txt:

circle_id	transcript_id	skipped_exon	intron	read_names	splice_reads	exon_reads
5_178885614_178931326	NM_030613	5:178916564-178916710	set([('5', 178913072, 178931236)])	MISEQ:136:000000000-ACBC6:1:2103:10044:24618,MISEQ:136:000000000-ACBC6:1:2115:19571:6931,MISEQ:136:000000000-ACBC6:1:1119:25537:8644	3	5
6_161034259_161049979	NM_001291958	6:161049332-161049852	set([('6', 161049332, 161049852)])	MISEQ:136:000000000-ACBC6:1:1113:25288:9067,MISEQ:136:000000000-ACBC6:1:2116:11815:3530	2	5

hek293_exon_chain_inferred_12.bed:

hek293_exon_chain_inferred_6.bed

hek293:

1_35358925_35361789_9reads.sorted.bam 1_35358925_35361789_9reads.sorted.bam.bai 1_20749723_20773610_4reads.sorted.bam 1_20749723_20773610_4reads.sorted.bam.bai

hek293.coverage_pictures:

1_35358925_35361789_NM_005095.png 1_20749723_20773610_NM_016287.png cluster_means_all_circles.png

hek293.coverage_profiles:

1_35358925_35361789.NM_005095.txt 1_20749723_20773610.NM_016287.txt coverage.clusters.all_circles.pdf coverage_profiles.all_circles.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
FUCHS		FUCHS
GCB_testset		GCB_testset
data		data
scripts		scripts
.gitignore		.gitignore
DESCRIPTION.rst		DESCRIPTION.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FUCHS - FUll circular RNA CHaracterization from RNA-Seq

Installation

Usage

Step by step tutorial

1. Mapping of RNA-Seq data

1.1. Mates separate mapping (optional for PE data)

2. Detection of circRNAs from chimeric.out.junction files with DCC

Acquiring suitable GTF files for repeat masking

Preparation of files containing the paths to required `chimeric.out.junction` files

Running DCC

Output files generated by DCC

3. Prepare input data for FUCHS

4. Running FUCHS

5. Optional FUCHS modules

Required input data

Output produced by FUCHS

About

Releases 4

Packages

Contributors 2

Languages

License

dieterich-lab/FUCHS

Folders and files

Latest commit

History

Repository files navigation

FUCHS - FUll circular RNA CHaracterization from RNA-Seq

Installation

Usage

Step by step tutorial

1. Mapping of RNA-Seq data

1.1. Mates separate mapping (optional for PE data)

2. Detection of circRNAs from chimeric.out.junction files with DCC

Acquiring suitable GTF files for repeat masking

Preparation of files containing the paths to required chimeric.out.junction files

Running DCC

Output files generated by DCC

3. Prepare input data for FUCHS

4. Running FUCHS

5. Optional FUCHS modules

Required input data

Output produced by FUCHS

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 2

Languages

Preparation of files containing the paths to required `chimeric.out.junction` files

Packages