FRAMA: From RNA-seq data to annotated mRNA assemblies
Perl R Makefile Shell
Switch branches/tags
Nothing to show
Clone or download
mBens Fixed alignment problem in mRNA-clipping
- renaming of logs
- adding archive possibility of trinity, blast and assembly directory
- switching from FindBin to Cwd due to FindBin problem in old versions
Latest commit 328495e Dec 13, 2017


This software was developed at the Leibniz Institute on Aging - Fritz Lipmann
Institute (FLI; under a mixed licensing model. This
means that researchers at academic and non-profit organizations can use it for
free, while for-profit organizations are required to purchase a license. By
downloading the package you agree with conditions of the FLI Software License
Agreement for Academic Non-commercial Research (FLI-LICENSE).


FRAMA is a transcriptome assembly and mRNA annotation pipeline, which utilizies external and newly developed software components. Starting with RNA-seq data and a reference transcriptome, FRAMA performs 4 steps:

1) de novo transcript assembly (Trinity),
2) gene symbol assignment (best bidirectional blastn hit) and
3) fusion detection and scaffolding
4) contig annotation (CDS, mRNA boundaries).

Further details (link):

Bens M et al. FRAMA: from RNA-seq data to annotated mRNA assemblies.
BMC Genomics. 2016;17:54. doi:10.1186/s12864-015-2349-8.



All you need is a reference transcriptome in GenBank format and RNA-seq data in FastQ format. You can also provide orthologs to your reference transcripts from other species. The additional homologs are used for CDS inference.


FRAMA runs on Linux and is written in Perl (5.10.0), R (3.0.3) and GNU Make (3.81). FRAMA does not require any compilation, but relies on common bioinformatic applications to be installed. The installation of all external software packages might seem like a daunting task, but your package manager might bring you halfway through (see Installation).

Bioinformatic Software

Software Link Description
Trinity mandatory
samtools mandatory
bamtools mandatory
bowtie1 mandatory
bowtie2 mandatory
EMBOSS mandatory
MAFFT mandatory
GENSCAN mandatory
RepeatMasker optional
CD-HIT-EST optional
TGICL optional
WU-BLAST mandatory

In case you do not use WU-BLAST:

Software Link Description
NCBI-BLAST mandatory
GenblastA mandatory

Perl Modules

Available via CPAN.

Module Version
BioPerl 1.006924
Parallel::ForkManager 0.7.5
Set::IntSpan 1.19
FileHandle::Unget 0.1628
Text::Soundex 3.05

Installation using cpanm:

cd setup && cpanm --installdeps .

R Packages

Package Version
plyr 1.8.3
ggplot2 1.0.1
reshape 0.8.5
gridExtra 2.0.0
annotate 1.44
GO.db 3.0
KEGG.db 3.0

Installation using FRAMA:

cd setup && Rscript --vanilla SETUP.R


In addition to FRAMA, you have to install all third-party tools described as 'mandatory' in the table above. Depending on your Linux platform, your package manager might bring you half the way through (see Manual Installation / Automatic Installation).

Installing FRAMA is quick and easy. Download and unpack this repository and make sure to set the permission to execute FRAMA. You can add FRAMA to your $PATH or create a symlink to FRAMA in one of the directories in $PATH.

Here is a suggest workflow, which adds FRAMA to your $PATH:

chmod u+x FRAMA
export PATH
# run example
FRAMA example/testing.cfg

Manual Installation of external software

For instance, on Ubuntu (17.04) :

sudo apt-get install perl default-jre r-base-core \
    ncbi-blast+ mafft emboss bowtie bowtie2 cd-hit \
    bamtools samtools parallel libc6-i386 build-essential \
    bioperl libparallel-forkmanager-perl libset-intspan-perl \
    libfilehandle-unget-perl r-cran-ggplot2 r-cran-plyr \
    r-cran-reshape lib32z1

Left to install manually:

  • Trinity, GENSCAN, Genblasta, RepeatMasker, TGICL
  • R-packages: gridExtra, annotate, GO, KEGG.db

Automatic Installation of external software

On 64bit platforms, FRAMA attempts to download and install (as non-root) missing software packages in very naive way. This might fail due to different/missing library/compiler versions on your system.

Required prerequesites for automatic installation include at least:

zesty cmake zlib >= 1 (zlib1g-dev) ncurses >= 5 (libncurses5-dev) jre >= 1.7.0 g++-4.9 gcc-4.9 libc6 (libc6-i386) # genscan, tgicl lib32z1 # tgicl

Start automatic installation:

cd FRAMA/setup

GENSCAN must be downloaded manually, due to licence restrictions.

mkdir genscan && tar xvf genscanlinux.tar -C genscan
mv genscan $FRAMA_DIR/sources/.
ln -f -s $(readlink -f $FRAMA_DIR/sources/genscan/genscan) $FRAMA_DIR/bin/genscan
ln -f -s $(readlink -f $FRAMA_DIR/sources/genscan/) $FRAMA_DIR/bin/genscan.dir


Make sure all mandatory parameters are specified in the configuration file (see Configuration section). Then, call FRAMA with the appropriate configuration file.

FRAMA configuration_file

That's all. In case of aborts, consult logfiles and remove incomplete results. Rerunning the above command will complete remaining tasks.

Same as above, but shows all called processes.

FRAMA configuration_file verbose

Start from scratch (removes all created files beforehand).

FRAMA configuration_file scratch

FRAMA uses GNU make as a backbone. Parameters other than verbose, scratch, full-cleanup, cleanup are forwarded to make. For example, the following lists all tasks without executing them.

FRAMA configuration_file -n


FRAMA creates a lot of intermediate files. See "output files" for further information about each file. We provide to two cleaning methods:

  1. full-cleanup: keeps important files

    FRAMA configuration_file full-cleanup

keeps the following files

  1. cleanup: keeps intermediate files for each transcript processing and trinity directory

    FRAMA configuration_file cleanup

additionally keeps:



Take a look at and try to run the provided example file in PATH_TO_FRAMA/example/testing.conf before running FRAMA on your own data set.

This also serves as a template for your custom configuration.

mandatory variables

The following depends mostly on your $PATH variable. Specify path to directories(!) of executables for each program that is not in your $PATH. Paths must be separated by space.

PATH_ALL := /home/user/src/cd-hit/ /home/user/src/EMBOSS/bin/
PATH_GENSCAN_MAT  := (point to Genscan Matrix to use)
PATH_BLASTDB := (point to Univector Database)

Indicate whether WU- or NCBI-BLAST should be used [0 WU, 1 NCBI].


Store intermediate and final files in specified location. Make sure that enough space is available to store intermediate output of trinity, blast results, read alignments, ...).

OUTPUT_DIR := /data/output

Input reads in fastq format. In case of paired end data, indicate elements of pair by "R1" and "R2" in filename (Example: sampleA_R1.fq, sampleA_R2.fq). All files must be in the same format (one of fastq, fasta, gzipped).

READ_DIR := /data/reads/

Reference transcriptome in GenBank format as provided by NCBI: -> [YOUR_REF_SPECIES] -> RNA/rna.gbk.gz


Specify taxonomy id of species to assemble. FRAMA connects to NCBI (once) to fetch necessary species information.

SPEC_TAXID := 458603

We use genome wide annotation packages from Bioconductor to assign functional annotation to the resulting transcript catalogue. Provide (and install) the annotation package corresponding to your reference species.



If you already have extracted mRNA and CDS sequences in FASTA format, provide them to FRAMA. Additionally, you can add a repeat (soft) masked FASTA of your reference sequence in order to skip RepeatMasking step.

REF_TRANSCRIPTOME_FASTA            := /data/human_mRNA.fa
REF_TRANSCRIPTOME_FASTA_MASKED     := /data/human_mRNA.fa.masked
REF_TRANSCRIPTOME_FASTA_CDS        := /data/human_cds.fa
REF_TRANSCRIPTOME_FASTA_CDS_MASKED := /data/human_cds.fa.masked

CDS inference is based on the coding sequence of the orthologous reference transcript. You can extend the number of orthologs used to infere the appropriate CDS by providing a table with mappings between orthologous transcript from different species. The first column must contain accession of the reference transcript. Add one column for each species you want to use and use 'NA' to indicate unknown orthologs. Additionally, specify taxonomy ID of each species in the first line (starting with #, tab separated). Keep in mind, that we perform a multiple sequence alignments with all coding sequences. Therefore, the number of species used will have an influence on runtime. Additionally, you must provide a fasta file containing all coding sequences mentioned in table (ORTHOLOG_FASTA).

ORTHOLOG_TABLE := /data/ortholog_table.csv
ORTHOLOG_FASTA := /data/ortholog_cds.fa

Example content ORTHOLOG_TABLE (also, take a look at exampe/ortholog_table.csv)

#9606   10090   10116   9615
NM_130786       NM_001081067    NM_022258       NA
NM_001198819    NM_001081074    NM_133400       XM_534776
NM_001198818    NM_001081074    NM_133400       XM_534776

We keep a note in GenBank output about the sequence name and species used to annotated the CDS. If multiple equally valid coding sequencing are found, the first species in SPECIES_ORDER will be used. Please specify the order of columns (0-based) in ORTHOLOG_TABLE to indicate your preferred order of species. Example:


Specify the primary processing steps you want to apply to the raw trinity assembly (space separated list) in preferred order. Possible steps are: cd-hit and tgicl. Leave empty to skip primary processing.


Soft masks repeats in assembly and reference. Set to 0 if you want to skip repeat masking.


Software parameter

!Consult manual for external software!

Number of cpus. This will be used for any software which runs in parallel.


If SGE is available (qsub), it will be used for blast jobs. Specify number of jobs.



Single end (s) or paired end (pe) reads?


Consult trinity manual.

Added automatically: --no_cleanup

OPT_TRINITY   := --JM 10G --seqType fa


Repeat masking reference/assembly.

Added automatically: -xsmall -par OPT_CPUS

OPT_REPEAT_REF_TRANSCRIPTOME := -species human -engine ncbi
OPT_REPEAT_ASSEMBLY          := -species human -engine ncbi


Added automatically: -T OPT_CPUS



Added automatically: -c OPT_CPUS


misassembled contigs

Used to detect fusion transcript. Specify maximum overlap (-max-overlap) between CDS regions (specifically: blast hits by coding sequences of reference transcriptome), minimum length of alignment (-min-frac-size), identity (min-identity) and coverage (min-coverage) thresholds.

OPT_FUSION := -max-overlap 5.0 -min-frac-size 200 -min-identity 70.0
 -min-coverage 90.0


BLAST and GENBLASTA Paramater, respectively.

Added automatically: -wordmask=seg lcmask -topcomboN 3 -cpus 1


SBH requiremnts

Specify minimum required identity and coverage to consider hit as SBH.

OPT_SBH := -identity=70.0 -coverage=30.0


Specify minimum required identity and contig coverage of blast hit to consider contig as possible scaffolding fragment.

OPT_FRAGMENTS   := -identity 70.0 -query-coverage 90.0

Specify minimum overlap between fragments in alignment to apply filtering rules (example: keeps sequence with higher similarity to reference if fragments differ over 98% in overlap, if overlap exceed 66% of contig length)

OPT_SCAFFOLDING := -fragment-overlap 66.0 -fragment-identity 98.0

CDS prediction

Add '-predictions' if you don't want to use predicted coding sequences (XM Accessions) for CDS inference. Don't use if your reference contains "XM" Accessions [TODO].

OPT_PREDICTCDS := -predictions

Output files

important files

File Description
transcriptome.gbk * GenBank file describing all annotated sequences.
transcriptome_CDS.fa Fasta with coding sequences.
transcriptome_mRNA.fa Fasta with transcript sequences (w/o introns; clipped ends).
transcriptome_CDS.csv Coordinates of CDS for mRNA sequences.
assembly_pripro.fa Trinity assembly after primary processing.
annotation.pdf General overview of transcript catalogue
annotation.csv Table containing summary for each annotated transcript.

*mRNA feature instead of 'gene' feature to limit mRNA boundaries in case of misassembled contigs

functional annotations (based on reference)

Table containing GO Terms associated with each annotated transcript. Also, overview of covered GO Terms and genes in total (genes_per_ontology) and in more detail (genes_per_path).


Same as above, but for KEGG Pathways.


intermediate output


Trinity output (including intermediates).


Running FRAMA creates a lot intermediate output which might come in handy in downstream analysis. Each transcript assignment is stored in a separate directory in


with the naming pattern according to assigned ortholog.


This directory includes the following files:

Result in GenBank format.


Raw GENSCAN output.


Assignment of transcript accession to GENSCAN prediction based on blast hits.


Multiple sequence alignment with orth. species requested in ORTHOLOG_TABLE


BLAST databases for reference and assembly.


BLAST results including average for each HSP-group (avg_*) and best hit per query (best_*).