Skip to content

FUCHS - FUll circle CHaracterization from rna-Seq

License

Notifications You must be signed in to change notification settings

daaaaande/FUCHS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FUCHS - FUll circular RNA CHaracterization from RNA-Seq

FUCHS is a python pipeline designed to fully characterize circular RNAs. It uses a list of circular RNAs and reads spanning the back-splice junction as well as a BAM file containing the mapping of all reads (alternatively of all chimeric reads).

The reads from one circle are extracted by FUCHS and saved in an individual BAM file. Based on these BAM files, FUCHS will detect alternative splicing within the same circle boundaries, summarize different circular isoforms from the same host-gene and generates coverage plots for each circle. It will also cluster circles based on their coverage profile. These results can be used to identify potential false positive circles.

Installation

FUCHS dependes on bedtools (>= 2.25.0) and samtools (>= 1.3.1) and Python (> 2.7; pysam>=0.9.1.4, pybedtools>=0.7.8, numpy>=1.11.2, pathos>=0.2.1) and R(> 3.2.0; amap, Hmisc, gplots). All Python an R dependencies will be installed automatically when installing FUCHS. Please make sure to have the correct versions of bedtools and samtools in your $PATH.

Clone the repository and install FUCHS using setup.py:

$ git clone git@github.com:dieterich-lab/FUCHS.git

$ cd FUCHS

$ python setup.py install --user

# This will install a FUCHS binary in $HOME/.local/bin/
# make sure this folder is in your $PATH

# Check the installation:

$ FUCHS --help

Usage

To characterize circRNAs from RNA-seq data you have to:

  1. Map RNAseq data from quality checked fastq files with either STAR , BWA, TopHat-Fusion.
  2. Detect circRNAs using DCC, CIRI, CIRCfinder or CIRCexplorer depending on the program you used for mapping.
  3. Run FUCHS (right now only the combination STAR + DCC has been tested; other setups are under development)

Step by step tutorial

In this tutorial we will be using HEK293 data available in this repository and use STAR with DCC to detect circular RNAs

1. Mapping of RNA-Seq data

Map RNA-seq data with STAR (Dobin et al., 2013). Note that --alignSJoverhangMin and --chimJunctionOverhangMin should use the same value, to make the circRNA expression and linear gene expression level comparable. Note that STARlong is not mapping chimeric reads correctly.

  • Note: The joined pair mapping should be run first. If the data are paired end, two additional separate mate mappings are recommended This step is not mandatory, but will increase the sensitivity of DCC detection, because it collect small circRNAs which appear with one chimeric junction point at each read mate. If the data is single end, only one mapping step is needed. In this case, PE sequencing data was used.
$ STAR --readFilesCommand zcat --runThreadN 18
       --genomeDir [genome]
       --outSAMtype BAM SortedByCoordinate
       --readFilesIn [sample]_1.fastq.gz ([sample]_2.fastq.gz)
       --outFileNamePrefix [sample]
       --quantMode GeneCounts
       --genomeLoad NoSharedMemory
       --outReadsUnmapped Fastx
       --outSJfilterOverhangMin 15 15 15 15
       --alignSJoverhangMin 15
       --alignSJDBoverhangMin 10
       --outFilterMultimapNmax 20
       --outFilterScoreMin 1
       --outFilterMismatchNmax 999
       --outFilterMismatchNoverLmax 0.05
       --outFilterMatchNminOverLread 0.7
       --alignIntronMin 20
       --alignIntronMax 1000000
       --alignMatesGapMax 1000000
       --chimSegmentMin 15
       --chimScoreMin 15
       --chimScoreSeparation 10
       --chimJunctionOverhangMin 15
       --twopassMode Basic
       --alignSoftClipAtReferenceEnds No
       --outSAMattributes NH HI AS nM NM MD jM jI XS
       --sjdbGTFfile [annotation].gtf
1.1. Mates separate mapping (optional for PE data)

Note: the mate assignments should be consistent throughout the mapping and circular RNA detection process. In the following case, SamplePairedRead_1.fastq.gz is the first mate which also was the first mate in the STAR call.

# remap unmapped reads as single end to obtain double breakpoint fragments

$ gzip sample/Unmapped.out.mate1
$ mv sample/Unmapped.out.mate1.gz sample/Unmapped_out_mate1.fastq.gz
$ STAR --readFilesCommand zcat --runThreadN 18 --genomeDir [genome] --outSAMtype BAM SortedByCoordinate --readFilesIn [sample]/Unmapped_out_mate1.fastq.gz --outFileNamePrefix [sample].mate1.  --quantMode GeneCounts --genomeLoad NoSharedMemory --outReadsUnmapped Fastx --outSJfilterOverhangMin 15 15 15 15 --alignSJoverhangMin 15 --alignSJDBoverhangMin 10 --outFilterMultimapNmax 20 --outFilterScoreMin 1   --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.05 --outFilterMatchNminOverLread 0.7 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000  --chimSegmentMin 15  --chimScoreMin 15   --chimScoreSeparation 10  --chimJunctionOverhangMin 15 --twopassMode Basic --alignSoftClipAtReferenceEnds No --outSAMattributes NH HI AS nM NM MD jM jI XS  --sjdbGTFfile [annotation].gtf

$ gzip sample/Unmapped.out.mate2
$ mv sample/Unmapped.out.mate2.gz sample/Unmapped_out_mate2.fastq.gz
$ STAR --readFilesCommand zcat --runThreadN 18 --genomeDir [genome] --outSAMtype BAM SortedByCoordinate --readFilesIn [sample]/Unmapped_out_mate2.fastq.gz --outFileNamePrefix [sample].mate2.  --quantMode GeneCounts --genomeLoad NoSharedMemory --outReadsUnmapped Fastx --outSJfilterOverhangMin 15 15 15 15 --alignSJoverhangMin 15 --alignSJDBoverhangMin 10 --outFilterMultimapNmax 20 --outFilterScoreMin 1   --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.05 --outFilterMatchNminOverLread 0.7 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000  --chimSegmentMin 15  --chimScoreMin 15   --chimScoreSeparation 10  --chimJunctionOverhangMin 15 --twopassMode Basic --alignSoftClipAtReferenceEnds No --outSAMattributes NH HI AS nM NM MD jM jI XS  --sjdbGTFfile [annotation].gtf

2. Detection of circRNAs from chimeric.out.junction files with DCC

Acquiring suitable GTF files for repeat masking
  • It is strongly recommended to specify a repetitive region file in GTF format for filtering.
  • A suitable file can for example be obtained through the UCSC table browser . After choosing the genome, a group like Repeats or Variation and Repeats has to be selected. For the track, we recommend to choose RepeatMasker together with Simple Repeats and combine the results afterwards.
  • Note: the output file needs to comply with the GTF format specification. Additionally it may be the case that the names of chromosomes from different databases differ, e.g. 1 for chromosome 1 from ENSEMBL compared to chr1 for chromosome 1 from UCSC. Since the chromosome names are important for the correct functionality of DCC a sample command for converting the identifiers may be sed -i 's/^chr//g' your_repeat_file.gtf
Preparation of files containing the paths to required chimeric.out.junction files
  • samplesheet file, containing the paths to the jointly mapped chimeric.out.junction files
$ cat samplesheet
/path/to/STAR/sample/joint_mapping/chimeric.out.junction
  • mate1 file, containing the paths to chimeric.out.junction files of the separately mapped first read of paired-end data
$ cat mate2
/path/to/STAR/sample.mate1/joint_mapping/chimeric.out.junction
  • mate2 file, containing the paths to chimeric.out.junction files of the separately mapped first read of paired-end data
$ cat mate2
/path/to/STAR/sample.mate2/joint_mapping/chimeric.out.junction
Running DCC

After performing all preparation steps DCC can now be started:

# Call DCC to detect circRNAs, using HEK293 data as example.

$ DCC @samplesheet \ # @ is generally used to specify a file name
      -mt1 @mate1 \ # mate1 file containing the mate1 independently mapped chimeric.junction.out files
      -mt2 @mate2 \ # mate2 file containing the mate1 independently mapped chimeric.junction.out files
      -D \ # run in circular RNA detection mode
      -R [Repeats].gtf \ # regions in this GTF file are masked from circular RNA detection
      -an [Annotation].gtf \ # annotation is used to assign gene names to known transcripts
      -Pi \ # run in paired independent mode, i.e. use -mt1 and -mt2
      -F \ # filter the circular RNA candidate regions
      -M \ # filter out candidates from mitochondrial chromosomes
      -Nr 2 2 \ minimum number of replicates the candidate is showing in [1] and minimum count in the replicate [2]
      -fg \ # candidates are not allowed to span more than one gene
      -G \ # also run host gene expression
      -A [Reference].fa \ # name of the fasta genome reference file; must be indexed, i.e. a .fai file must be present

# For details on the parameters please refer to the help page of DCC:
$ DCC -h

Notes:

  • By default, DCC assumes that the data is stranded. For non-stranded data the -N flag should be used.
  • Although not mandatory, we strongly recommend to the -F filtering step
Output files generated by DCC

The output of DCC consists of the following four files: CircRNACount, CircCoordinates, LinearCount and CircSkipJunctions.

  • CircRNACount: a table containing read counts for circRNAs detected. First three columns are chr, circRNA start, circRNA end. From fourth column on are the circRNA read counts, one sample per column, shown in the order given in your samplesheet.
  • CircCoordinates: circular RNA annotations in BED format. The columns are chr, start, end, genename, junctiontype (based on STAR; 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT), strand, circRNA region (startregion-endregion), overall regions (the genomic features circRNA coordinates interval covers).
  • LinearCount: host gene expression count table, same setup with CircRNACount file.
  • CircSkipJunctions: circSkip junctions. The first three columns are the same as in LinearCount/CircRNACount, the following columns represent the circSkip junctions found for each sample. circSkip junctions are given as chr:start-end:count, e.g. chr1:1787-6949:10. It is possible that for one circRNA multiple circSkip junctions are found due to the fact the the circular RNA may arise from different isoforms. In this case, multiple circSkip junctions are delimited with semicolon. A 0 implies that no circSkip junctions have been found for this circRNA.

3. Prepare input data for FUCHS

The files chimeric.sam, mate1.chimeric.sam, and mate2.chimeric.sam files for FUCHS have to be merged (not necessary if circles were detected using BWA/CIRI)

# convert SAM to BAM
$ samtools view -Sb -o sample sample/Chimeric.out.sam
$ samtools view -Sb -o sample.1 sample.1/Chimeric.out.sam
$ samtools view -Sb -o sample.2 sample.2/Chimeric.out.sam

# sort both BAM files
$ samtools sort -o sample.sorted.bam sample.bam
$ samtools sort -o sample.1.sorted.bam sample.1.bam
$ samtools sort -o sample.2.sorted.bam sample.2.bam

# create an index for both BAM files
$ samtools index sample.sorted.bam
$ samtools index sample.1.sorted.bam
$ samtools index sample.2.sorted.bam

# merge both mate BAM files into one new BAM file
$ samtools merge merged_sample.bam sample.sorted.bam sample.1.sorted.bam sample.2.sorted.bam

# re-index the newly aggregated BAM file
$ samtools index merged_sample.bam

4. Running FUCHS

Run FUCHS to start the pipeline which will extract reads, check mate status, detect alternative splicing events, classify different isoforms, generate coverage profiles and cluster circRNAs based on coverage profiles

# using STAR/DCC Input
$ FUCHS -r 2 -q 2 -p ensembl -e 2 -T ~/tmp
        -D CircRNACount
        -J sample/Chimeric.out.junction
        -F sample.1/Chimeric.out.junction
        -R sample.2/Chimeric.out.junction.fixed
        -B merged_sample.sorted.bam
        -A [annotation].bed
        -N sample

# if BWA/CIRI was used, use -C to specify the circIDS list (omit -D, -J, -F and -R)
# For details on the parameters please refer to the help page of FUCHS:
$ FUCHS --help

5. Optional FUCHS modules

Run the additional module guided_denovo_circle_structure_parallel.py to obtain a more refined circle reconstruction based on intron signals. The circRNA seperated bamfiles (step 2) are the only input needed for the module. If you supply an annotation file, unsupported exons will be reported with a score of 0, if you do not supply an annotation file, unsupported will not be reported.

$ guided_denovo_circle_structure_parallel -c 18 -A [annotatation].bed -I FUCHS/output/folder -N sample

# FUCHS/output/folder corresponds to the output directory of the FUCHS pipeline
# sample corresponds to your sample name, just as specified for the pipeline

That's all folks

Required input data

circIDs:

circID read1,read2,read3
1:3740233|3746181 MISEQ:136:000000000-ACBC6:1:2107:10994:20458,MISEQ:136:000000000-ACBC6:1:1116:13529:8356
1:8495063|8614686 MISEQ:136:000000000-ACBC6:1:2118:9328:9926

The first column contains the circle id formated as folllowed chr:start|end. The second column is a comma separated list of read names spanning the back-splice junction.

bamfile: Alignment file produced by any mapper. This file must contain all chimerically mapped reads and may contain also linearly mapped reads.

bedfile:

Chr Start End Name Score Strand
1 67092175 67093604 NR_075077_exon_0_0_chr1_67092176_r 0 -
1 67096251 67096321 NR_075077_exon_1_0_chr1_67096252_r 0 -
1 67103237 67103382 NR_075077_exon_2_0_chr1_67103238_r 0 -

Normal BED file in BED6 format. The name should contain a gene name or gene ID and the exon_number. You can specify how the name should be processed using -p (platform), -s (character used to separate name and exon number) and -e (exon_index).

Output produced by FUCHS

hek293.alternative_splicing.txt:

This file summarizes the relationship of different circRNAs derived from the same host-gene.

Transcript circles same_start same_end overlapping within
NM_016287 1:20749723-20773610 . . . .
NM_005095 1:35358925-35361789,1:35381259-35389082,1:35381259-35390098 1:35381259-35389082|1:35381259-35390098, . . .
NM_001291940 1:236803428-236838599,1:236806144-236816543 . . . 1:236803428-236838599|1:236806144-236816543,
Transcript: Transcript name as defined by the bed-annotation file
circles: Comma-separated list of circRNA ids derived from this transcript
same_start: Comma-seprated list of circRNA pairs separated by |. Pairs in this column share the same start coordinates. A "." indicates that there are no circle pairs that share the same start coordinates.
same_end: Same as same_start, only now, circle pairs share the same end coordinates.
overlapping: Comma-seprated list of circRNA pairs separated by |. Pairs in this column share neither start nor end coordinates, but their relation is such that: start.x < start.y && end.x < end.y && start.y < end.x
within: Same as overlapping, only now, circle pairs have the follwoing relation: start.x < start.y && end.x > end.y

hek293.exon_counts.bed: This file is a bed-formatted file that describes the exon-structure and can be loaded into any genome browser. Each line corresponds to a circRNA.

Chr Circle Start Circle End Transcript Num of Reads Strand Start End Color Num of Exon Exon Lengths Relative Exon Starts
chr1 35358925 35361789 NM_005095 9 + 35358925 35361789 0,255,0 3 521,61,170 0,2269,2694
chr1 20749723 20773610 NM_016287 4 - 20749723 20773610 0,255,0 4 159,90,143,159 0,7443,21207,23728
Chr: Chromosome of circRNA
Circle Start: The 5' site of the chimeric junction. This is relative to the reference strand, i.e. start < end! The location is 1-index based
Cirlce End: The 3' site of the chimeric junction. This is relative to the reference strand, i.e. start < end! The location is 0-index based
Transcript: Transcript name as defined by the bed-annotation file
Num of Reads : Number of reads supporting this chimeric junction, in other words, reads that are chimerically mapped to this junction
Strand: Strand of the host-gene
Start: Copied Circle Start to stay conform with BED12 format
End: Copied Circle End to stay conform with BED12 format
Color: pre defined color the exons will show up in the genome viewer (0,255,0 -> green)
Num of Exon: Number of exons in this circRNA consists of
Exon Lengths: Comma-seprated list of the length of each exon
Relative Exon Starts: Comma-separated list of the relative starting positions of the exons within the circle boundaries.

hek293.exon_counts.txt: This file contains similar information as the previous file, just more detailed inforamtion on the exons. Each line corresponds to one exon.

sample circle_id transcript_id other_ids exon_id chr start end strand exon_length unique_reads fragments number+ number-
hek293 1:35358925-35361789 NM_005095 NM_005095 2 1 35358924 35359446 + 522 9 9 4 5
hek293 1:35358925-35361789 NM_005095 NM_005095 3 1 35361193 35361255 + 62 3 3 1 2
hek293 1:35358925-35361789 NM_005095 NM_005095 4 1 35361618 35361789 + 171 9 9 4 5
hek293 1:20749723-20773610 NM_016287 NM_016287 3 1 20749722 20749882 - 160 4 4 4 0
hek293 1:20749723-20773610 NM_016287 NM_016287 4 1 20757165 20757256 - 91 1 1 1 0
hek293 1:20749723-20773610 NM_016287 NM_016287 5 0 0 0 0 0 0 0 0 0
hek293 1:20749723-20773610 NM_016287 NM_016287 6 0 0 0 0 0 0 0 0 0
hek293 1:20749723-20773610 NM_016287 NM_016287 7 1 20770929 20771073 - 144 1 1 1 0
hek293 1:20749723-20773610 NM_016287 NM_016287 8 1 20773450 20773610 - 160 4 4 4 0
sample: Sample name as specified by the user. This is useful if the user wants to merge files from different samples
circle_id: circRNA-ID. The circleID is formatted to be copy and pasted to a genome browser for easy access
transcript_id: Transcript name as defined by the bed-annotation file. This is the best fitting transcript. i.e. the splicing variants that contains the most exons that are actually covered
other_ids: Alternative Transcript names that are either just as fitting, or contain more or less exons as supported by reads
exon_id: Exon number relative to the host-gene of the circularized exon. One circle may have more than one exon. These will be listed as consecutive lines
chr: Chromosome the circRNA is located on
start: 5' start of the exon, relative to the reference strand, 0-based
end: 3' end of the exon, relative to the reference start, 0-based
strand: Strand of the host-gene
exon_length: Length of the current exon
unique_reads: Number of unique reads associated with the chimeric junction. When the data is paired end, then both ends are considered as separate reads.
fragments: Number of broken fragments aligning to the circle
number+: Number of reads spanning the chimeric junction on the forward strand
number-: Number of reads spanning the chimeric junction on the reverse strand (if reads are only from one strand, it could indicate, that there is a sequencing bias.)

hek293.mate_status.txt: This output file contains the results of analysing the amount of how often each fragment spans a chimeric junction. A fragment can either span the chimeric junction once (single), only one end spans the junction, twice (double) both ends span the chimeric junction, or more than twice (undefined).

circle_id transcript_ids num_reads min_length max_length single double undefined
1_20749723_20773610 NM_016287 4 790 790 4 0 0
1_35358925_35361789 NM_005095 9 754 754 9 0 0
circle_id:
transcript_ids:
num_reads:
min_length:
max_length:
single:
double:
undefined:

hek293.skipped_exons.bed:

Chr Circle-Start Circle-End Transcript Ratio Strand Intron-Start Intron-End Color NumExon-2 IntronLength RelativeStart
chr5 178885614 178931326 NM_030613 60.0 . 178913072 178931236 255,0,0 3 1,146,1 0,30950,45711
chr6 161034259 161049979 NM_001291958 40.0 . 161049332 161049852 255,0,0 3 1,520,1 0,15073,15719

hek293.skipped_exons.txt:

circle_id transcript_id skipped_exon intron read_names splice_reads exon_reads
5_178885614_178931326 NM_030613 5:178916564-178916710 set([('5', 178913072, 178931236)]) MISEQ:136:000000000-ACBC6:1:2103:10044:24618,MISEQ:136:000000000-ACBC6:1:2115:19571:6931,MISEQ:136:000000000-ACBC6:1:1119:25537:8644 3 5
6_161034259_161049979 NM_001291958 6:161049332-161049852 set([('6', 161049332, 161049852)]) MISEQ:136:000000000-ACBC6:1:1113:25288:9067,MISEQ:136:000000000-ACBC6:1:2116:11815:3530 2 5

hek293_exon_chain_inferred_12.bed:


hek293_exon_chain_inferred_6.bed


hek293:

1_35358925_35361789_9reads.sorted.bam 1_35358925_35361789_9reads.sorted.bam.bai 1_20749723_20773610_4reads.sorted.bam 1_20749723_20773610_4reads.sorted.bam.bai


hek293.coverage_pictures:

1_35358925_35361789_NM_005095.png 1_20749723_20773610_NM_016287.png cluster_means_all_circles.png


hek293.coverage_profiles:

1_35358925_35361789.NM_005095.txt 1_20749723_20773610.NM_016287.txt coverage.clusters.all_circles.pdf coverage_profiles.all_circles.pdf

About

FUCHS - FUll circle CHaracterization from rna-Seq

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 87.4%
  • R 12.4%
  • Shell 0.2%