Skip to content
florianerhard edited this page Jan 16, 2019 · 37 revisions

Table of Contents

Description

PRICE (Probabilistic inference of codon activities by an EM algorithm) is a method to identify ORFs using Ribo-seq experiments embedded in a pipeline for data analysis

1. PRICE is a method to identify ORFs

PRICE is a novel method to identify actively translated codons and ORFs using Ribosome profiling (Ribo-seq) experiments. For this purpose, previously all reads from a specific class (e.g. 29bp long reads) were mapped to the codon at a specific position within the read (e.g. position 12). Classes and positions were determined using genome-wide statistics. PRICE, in contrast, implements a statistical model of the Ribo-seq experiments and infers actively translated codons using maximum likelihood for all reads. Based on the inferred codons, ORF candidates are screened for signatures of active translation using hypothesis tests based on the generalized binomial distribution (which enables to handle complex cases, e.g. of overlapping ORFs).

2. PRICE is full fledged pipeline for Ribo-seq data

PRICE is a software pipeline including all steps necessary to identify and score codons and ORFs starting from one or more fastq files from Ribo-seq experiments. The pipeline includes modules to preprocess (adapter clipping, filtering) and map (Bowtie against transcriptome and genome) sequencing reads, to rescue multimapping reads, to apply the PRICE method to multiple data sets at once and to visualize results (inferred codons and ORFs) using a high-performance genome browser tailored for Ribo-seq data and subsequent integrative analyses.

Overview

To use PRICE, you must do the following

  1. Download and install it (once in your lifetime)
  2. Prepare a genome (once per genome)
  3. Run it (once per data set)

Download and installation

Make sure you have the following packages installed on your system:

  • Java 1.8
  • BASH to run the different PRICE programs
  • R to generate plots (including the packages ggplot2, reshape2 and data.table)

If you only want to run PRICE on some BAM file you already prepared, without generating plots, having Java 1.8 installed is sufficient.

PRICE is written in Java and is therefore generally platform independent. We recommend a Linux system, where PRICE has been extensively tested, and, especially when you intent to analyse several experiments, a cluster environment.

Follow these steps to prepare your system for running PRICE:

  1. Download PRICE by choosing the current version from the releases tab and unpack it into any folder (e.g. /home/erhard/Programs/PRICE_1.0.1)

  2. Make sure that you have the correct directory structure: You should have a directory containing the gedi script, the file gedi.jar as well as the subdirectory lib containing several jar files

  3. Make sure the gedi script is executable:

    • Always call it with its full path:
    /home/erhard/Programs/PRICE_1.0.1/gedi -e Price -h
    export PATH=$PATH:/home/erhard/Programs/PRICE_1.0.1
    gedi -e Price -h
    • Use the graphical configuration checker tool: Double-click the gedi.jar file (if double-clicking does not work, do java -jar gedi.jar on the command line instead). And use the "Extend PATH variable in .bashrc" option##

Caution

You need JavaFX for that third option, which is not included in openJDK used as default Java distribution in many Linux systems; either install the Oracle Java package, or alternatively install the openjfx package. Also has an option to include the PATH setting in your configuration automatically.

Prepare a genome

Assuming you are able to call the gedi script:

gedi -e IndexGenome -organism homo_sapiens -version 75 -p -nobowtie -nostar -nokallisto

This will download the genomic sequence (fasta file) as well as the corresponding annotation (gtf file) from ensembl and do some steps to prepare it for usage in PRICE. Its name for referencing it will be homo_sapiens.75. For details about preparing genomes, see below

Run PRICE

Assuming you are able to call the gedi script and have mapped reads from a Ribo-seq experiment (this is about the example data from the HSV data set, chromosome 8+):

gedi -e Price -reads hsv_8.cit -genomic homo_sapiens.75 -prefix price/hsv -progress -plot

Always make sure, that the given genome is the same as used to map reads (which is ensembl v75 for the example data). For details, see below

Inspect results

  1. Start the genome browser

    gedi -e RiboView -a price -g homo_sapiens.75 -l RAB2A
  2. Generate bedgraph files of the frame specific coverages and a bed file containing the ORFs to view in alternative genome browsers

    gedi -e ViewCIT -m bed -name 'd.getType()'  price/hsv.orfs.cit > hsv.bed
    gedi -e ViewCIT -m bedgraph -filter 'ref.isPlus() && d.sum()>=1' -score 'd.sum()' price/hsv.codons.cit > hsv.+.bedgraph
    gedi -e ViewCIT -m bedgraph -filter 'ref.isMinus() && d.sum()>=1' -score 'd.sum()' price/hsv.codons.cit > hsv.-.bedgraph
    

The ViewCIT utility is quite powerful and described in more detail below. For instance, if you are only interested in a single gene, you can filter for it using the additional parameter -q '8+:61429415-61536186' (e.g. to get the RAB2A gene).

Importantly, bedgraph (and other file formats for numeric data on genomes such as wig or bigwig) are not strand specific, so separate files must be generated for the plus and minus strands. After that, you may filter these files for a specific frame and convert them to bigwig.

Using the output

The main output table of PRICE is prefix.orfs.tsv. It has the following columns:

Name Content
Gene Gene id
Id ORF id made of the transcript id where it has been identified, the inferred ORF type and an increasing number within the transcript
Location Location of the ORF from start to stop codon; Coordinates are zero based, start is inclusive, end not inclusive; introns are separated by a pipe
Candidate Location Location of the ORF candidate before start codon prediction
Codon Start codon of the ORF
Type Inferred ORF type
Start Start codon score from the logistic regression to find start codons (the larger the better)
Range Range score from the logistic regression to find start codons (the larger the better)
p value P value for the generalized binomial test (not multiple testing corrected)
Condition-name Estimated, non-normalized total number of reads from this ORF
Total Total number of reads from this ORF

The ORF types are as follows:

Name Meaning
CDS ORF is exactly as in the annotation
Ext ORF contains a CDS, ending at its stop codon
Trunc ORF is contained in a CDS, ending at its stop codon
Variant ORF ends at a CDS stop codon, but is neither Ext nor Trunc
uoORF ORF starts in 5'-UTR, ends within a CDS
uORF ORF starts and ends in 5'-UTR
iORF ORF is contained within a CDS
dORF ORF ends in 3'-UTR
ncRNA ORF is located on non-coding transcript
intronic ORF is located in an intron
orphan Everything else

PRICE also checks that the CDS is valid (as in Ensembl there are a number of truncated CDS, e.g. with a length not divisible by 3). If an ORF is assigned to such a transcript, it may be classified an orphan, if PRICE cannot find a better matching type.

Installing and configuring PRICE

The basic way to install PRICE is to download the latest release (from the releases tab above) and unpack it to a folder of your choice. There are, however, more advanced options to install it.

To run PRICE, you need a few additional software packages installed on your system. To facilitate the painful task of resolving all these dependencies, Gedi comes with a tool that checks all required software packages and gives hints how to install them. This tool can be started by double-clicking the gedi.jar file or, equivalently, by calling

java -jar gedi.jar

on the command line.

Specifically, the required software packages are

  • Java 1.8
  • JavaFX (included in the Oracle Java package, alternatively install the openjfx package)
  • bash to run the different PRICE programs
  • R to generate plots (including the packages ggplot2, reshape2, Rserve, ROCR and data.table)

There also is optional software for specific functionality:

  • a cluster environment for large-scale processing (e.g. based on slurm or sge)
  • kraken for read preprocessing
  • bowtie for read mapping
  • the NCBI SRA-toolbox for accessing data sets from SRA

Preparing a reference

See here

Using PRICE

The PRICE program executes several steps that are necessary to analyze Ribo-seq data. It has several parameters (three of which are mandatory), and produces several output files.

Its basic usage is

gedi -e Price -reads <mapped-reads> -genomic <prepared-genomes> -prefix <prefix>

Parameters

PRICE has three mandatory parameters

  • -reads: Can either be a single BAM file containing mapped reads (make sure that you mapped reads in end-to-end mode, otherwise the information of 5'-end mismatches is lost; also, PRICE needs the MD attribute to identify mismatches, so make sure it is present in the BAM file!), a *.bamlist file containing names of BAM files (one per line), or a CIT file (e.g. as produced by the pipeline).
  • -genomic: one or several (separated by space) names of prepared references; make sure that these are the same as used to map the reads!
  • -prefix: A user chosen prefix for all output files (e.g. -prefix price/hsv)

In addition, several options are available:

  • -progress: Show the progress interactively on the command line
  • -plot: Generate several plots (PRICE model, signal-to-noise ratio plot)
  • -nthreads <n>: Set the number of parallel threads to run PRICE (default: use all cores)
  • -delta <d>: Set a regularization parameter (default is no regularization)
  • -inferDelta: Infer the regularization parameter such that that a 10% off-frame codon is recognized as such
  • -novelTranscripts: Use mapped reads to infer novel transcripts (important for not so well annotated organisms, or when only CDS are annotated)
  • -fdr <q>: set the desired false discovery rate (default: 10%)
  • -percond: Infer the codon model for each processed sample individually (instead of one common model for all samples of the experiment)

Output files

PRICE generates several output files:

  • ${prefix}.param: A list of all parameters used to call PRICE
  • ${prefix}.orfs.tsv: A table (e.g. to import into Excel) containing all the information about all called ORFs (before filtering!)
  • ${prefix}.orfs.cit: A CIT file containing the ORFs filtered by FDR (default: 10%) (to extract ORFs quickly using ViewCIT and used by the viewer)
  • ${prefix}.orfs.cit.metadata.json: The associated metadata of the CIT file (condition names, total read numbers, etc.)
  • ${prefix}.estimateData: table containing the minimal sufficient statistics for estimating the PRICE model
  • ${prefix}.model: the PRICE model in binary format
  • ${prefix}.codons.cit: A CIT file containing all called codons
  • ${prefix}.*.rmq: Several index files to efficiently show codon activities in the viewer.

If the -plot option has been used, additional files are generated:

  • ${prefix}.signal.tsv: Table containing the data for the signal-to-noise plot
  • ${prefix}.model.tsv: Table containing model parameters
  • ${prefix}.signaltonoise.svg: The signal-to-noise plot
  • ${prefix}.model.*.svg: Visualizations of the PRICE model

Several temporary files are created during runtime (usually deleted in the end, except when the -keep option is used)

  • ${prefix}.clusters.cit: CIT file containing the chunks on the genome, that were used for all steps
  • ${prefix}.codons.bin: Binary file containing the inferred codons for all chunks
  • ${prefix}.start.model: Logistic regression models for start codon prediction
  • ${prefix}.noise.model: Noise models for the generalized binomial test
  • ${prefix}.orfs.bin: Binary list of call ORFs, prior to multiple testing correction
  • ${prefix}.tmp.orfs.cit: CIT file containing all ORFs prior to deconvolution of isoforms.

Using the pipeline

PRICE does not utilize any of the available Bioinformatics pipeline tools (Explanation). Instead, it utilizes templates to generate a bash script for a given parameter file, which simply is the pipeline of a project. The pipeline can easily be modified manually, if needed, and can be executed by anyone without additional software, greatly facilitating reproducible science.

Modes of execution

There are three ways how programs can be executed in such a pipeline and dependencies (e.g. step 2 needs the results of step 1) are resolved:

  1. Serially, i.e. all commands are executed by bash in the foreground (no need to resolve any dependencies).
  2. In parallel, i.e. commands that do not depend on each other are executed in the background (using &), and dependencies are resolved using the bash command wait
  3. On a cluster using a grid engine such as sge or slurm. These bring their own mechanisms to resolve dependencies.

Parameters

Parameters are specified using JSON, preferably using a configuration file:

{
"datasets": [
		{"gsm": "GSM1020234", "name": "hcmv.5hpi.cyc"},
		{"gsm": "GSM1020235", "name": "hcmv.24hpi.cyc"},
		{"gsm": "GSM1020236", "name": "hcmv.72hpi.cyc.A"},
		{"gsm": "GSM1020237", "name": "hcmv.72hpi.cyc.B"},
		{"gsm": "GSM1020238", "name": "hcmv.5hpi.harr"},
		{"gsm": "GSM1020239", "name": "hcmv.72hpi.harr"},
		{"gsm": "GSM1020240", "name": "hcmv.5hpi.ltm"},
		{"gsm": "GSM1020241", "name": "hcmv.72hpi.ltm"},
		{"gsm": "GSM1020242", "name": "hcmv.5hpi.non"},
		{"gsm": "GSM1020243", "name": "hcmv.72hpi.non"}
	],
"references": {"h.ens86": "both", "h.rRNA": "rRNA","NC_006273": "both"},
"adapter":  "xxxxxxxxxxxxxxxxxxxx"
}

Dataset entries are either of type fastq (referring to a fastq file), sra (an SRA record id: SRRnnnnn) or geo (a GEO sample id: GSMnnnnn). References refer to prepared references and define the sequences to map reads to (Genomic: the genome only; Transcriptomic: the transcriptome only; Both: Genome and Transcriptome; rRNA: Use this prepared reference as a filter and discard all mapping reads). The adapter entry defines the 3' sequencing adapter (which can be omitted and will then be inferred by the minion tool of the kraken suite).

Templates

Templates are JHP files that can be processed into bash scripts or similar. Usually, a template requires certain variables to be set, and these variables correspond to the top-level keys in the JSON file (e.g. dataset or references). For PRICE, three templates are of interest:

  • shortread_mapping.sh: It does everything necessary to produce mapped reads from the datasets variable. For instance, for gsm entries, corresponding srr ids are identified using the eutils REST API from NCBI, downloaded using fastq-dump, adapter sequences identified using minion from the kraken suite, if not specified for the dataset (e.g. {"geo":"GSMXXXXXX","name":"Test","adapter":"TCAGGGAGACTAC"} or globally (i.e. a top-level "adapter":"TCAGGGAGACTAC"). Adapters are trimmed using reaper from the kraken suite. Then reads are mapped against all rRNA references and mapping reads are discarded. Remaining reads are mapped against all other references, and merged (i.e. for each read, all mappings are transformed to genomic coordinates, and only the ones with the least mismatches are kept). Finally, all mapping files are integrated into a single CIT file containing all mapped reads of all experiments from a project.
  • resolve_ambi.sh: For a given CIT file containing read mappings, multimapping reads are resolved (see the PRICE paper for a more detailed description)
  • report.sh: Computes read mapping statistics important for sequencing experiments.
  • price.sh: Infers the parameters of the PRICE model, computes codon activities, calibrates several error models and infers and scores ORFs.

Putting it together

The pipeline can be created using a command line tool:

gedi -e Pipeline -r cluster(lrz.slurm) -j hek293.json shortread_mapping.sh resolve_ambi.sh report.sh price.sh
  • The -r parameter determines the mode of execution (serial/parallel/cluster(<clustername>); the clustername must be configured in the configuration tool.
  • -j specifies a json parameter file for the given template files. Additional top-level keys can be directly specified, e.g. --tmp=/mnt/ssd/tmp, and more than one -j parameter can be present. Parameters are replaced from left to right.
  • Finally, at least one template must be given (either a path to a template, or the name of one of the internal templates)

Important to know is that each template defines input and output variables. For instance, the template price.sh needs the variable reads to be set to a file path containing read mappings. Thus, if the pipeline tool is called without short_bowtie and resolve_ambi.sh, the variable reads has to be set either in the json file or directly via --reads=xyz. The template shortread_mapping.sh sets this variable, making the command above a valid example. The input and output variables are printed with the parameter -h <template ...>. For instance, for the whole pipeline:

Input:
 tmp             Temp directory
 datasets        Dataset definitions
 references      Definitions of reference sequences
 test            Test with the first 10k sequences [optional]
 keeptrimmed     Keep the trimmed fastq files [optional]
 reads           Filename containing read mappings
 startcodon      Treatment pairs for start codon prediction [optional]

The full usage of the pipeline tool is

Pipeline [options] <template> [<template>...]

Takes one or more templates (either file names or labelled names from the resources), and executes them, thereby replacing values from json-files or given by --var=val.

        Options:
        -n [name]               Specify name (Default: Name of the last -j parameter)
        -o [file]               Specify output file (Default: $name/scripts/start.bash)
        -wd [folder]            Specify working directory (Default: `pwd`/$name)
        -log [folder]           Specify log folder (Default $wd/log)
        -r [runner]             Specify runner method (serial/parallel/cluster(<clustername>)); see gedi -e Cluster for details about clusternames)
        --x.y[3].z=val          Specify other runtime variable; val can be json (e.g. --x='{\"prop\":\"val\"}'
        -j [json-file]          Specify other runtime variables
        -h [<template>...]      Print usage information (of this program and the given templates)

Examples

gedi -e Pipeline -r cluster(lrz.slurm) -j hek293.json shortread_mapping.sh resolve_ambi.sh report.sh price.sh

Create the full pipeline for the hek293 project to run on the LRZ Linux cluster

gedi -e Pipeline -r cluster(lrz.slurm) -j hek293.json shortread_mapping.sh price.sh

The same, but without rescuing multimapping reads and creating the reports.

gedi -e Pipeline -r serial -j hek293.json --tmp=/mnt/ssd/tmp --reads=hek293/mapped.cit price.sh

Only execute the PRICE method on the mapped reads in mapped.cit. Do not use a cluster but execute all steps locally one after the other. Here, the variable reads that is expected by the template price.sh must be set via a parameter to gedi -e Pipeline, as it is not present in the json file (but set by the shortread_mapping.sh template in the example above). Also, the variable tmp is set, i.e. temp files are written to the SSD mounted at /mnt/ssd

Visualizing results

The genome browser implemented in PRICE can be called once all three programs ran through:

gedi -e RiboView -a price -g homo_ens86

Its basic functions are:

  • Zoom by mouse wheel
  • Pan by holding the left mouse button and drag
  • Enter a location or gene name in the address bar
  • Press F5 to make screenshots
  • Double-click onto a transcript or ORF to zoom in on it (shrinking all introns)
  • Hide and show tracks by hitting the Tracks button
  • Drawing manual annotations in the editor track (Double-click: rename, Drag: move it, Drag an edge: resize it, Right-click: Save or load annotations)

The full usage is:

gedi -e RiboView <Options>

Options:
 -l <location>                  Location string (in the same format as in the address line in the browser)
 -a <folder ...>                Load all in the given folders
 -f <prefix1 prefix2 ...>       Prefixes for input files
 -t <file>                      File containing total counts (from RiboStatistics)
 -g <genome1 genome2 ...>       Genome names
 -tracks <file1 file2...>       Bed or location files to form additional tracks

 -D                     Output debugging information
 -h                     Show this message

Additional tools in Gedi

Gedi contains additional programs that are relevant for PRICE users.

Pipeline statistics

This is also called by the pipeline and produces several tables containing statistics alongside with corresponding plots.

gedi -e Stats -g homo_sapiens.75 -prefix report/hsv. hsv_8.cit

Its full usage is:

gedi -e Stats <Options> <storage-file>

Options:
 -l <location>                  Location string
 -g <genome1 genome2 ...>                       Genome names
 -downsampling <method>                 Set downsampling method (one of Max,Logsc,No,Digital, default is Max)
 -mode <mode>                   Set read count mode (one of All,Weight,Divide,Unique,CollapseAll,CollapseUnique, default is Weight)
 -interactive                   Display all plots progressively in a graphical window
 -prefix <prefix>                       Prefix for output files (default: report/<filename>.)
 -out [<oml>]           Write current pipeline to oml file (and do not process storage)

 -h [<template>]        Print usage information (of this program and the given template)
 --x.y[3].z=val         Define a template variable; val can be json (e.g. --x='{\"prop\":\"val\"}'
 -j <json-file>         Define template variables
 -t <template-file>             Process a template (default: reads for a storage containing reads, default otherwise)

 -oml [<oml>]           Process using the given oml file

 -nthreads <int>                        Run with given number of threads (default: min(8 or number of available threads))
 -p                     Show progress
 -D                     Output debugging information
 -h                     Show this message

Storage file can be a bam file or a cit file containing reads.

FAQ

I have several samples (replicates, conditions); how do I use the pipeline?

The pipeline works as follows:

  1. Read mapping: Reads from each sample (specified in the datasets property of the json) are mapped individually against the specified references (i.e. we will have one mapping file per datasets entry).
  2. Merging: Mapping files are merged into a single file (preserving the information which read mapping came from which sample, i.e. this is not pooling the samples)
  3. Reporting: When report.sh is used, several statistics are written into the report directory along with index.html, which can be loaded in any webbrowser to look at the reports
  4. PRICE: PRICE reads the merged mappings and computes its codon model. Either a single model is estimated for all samples, or, when using the -percond parameter (default when using the pipeline), for each sample a separate model.
  5. PRICE: Using the codon model(s), reads are mapped to P site codons, these are assembled into ORFs and scored using the generalized binomial test. The .orfs.tsv contains all ORFs (unfiltered) along with the number of reads for each sample.

Thus, if you have several conditions with several replicates each, you should create a single json file and process them together. To get the total number of reads for each ORF in each of the samples, look at the .orfs.tsv file.

What are the modes of execution (-r) of the pipeline?

The pipeline consists of several jobs (e.g. for each sample one mapping job, a merge job, a report job, a PRICE job). the mode of execution determines how jobs are executed. Currently, there are three modes of execution: serial, parallel and cluster. In serial execution, all jobs are executed one after another. In parallel execution, jobs are executed in parallel, if possible (by starting them as background processes and then use bash's wait command). In cluster mode, a grid management system (such as SLURM) is used. Importantly, in all cases the pipeline takes care of dependencies (e.g. merging the mappings can only be started once all mapping jobs are finished.

What are the pitfalls of using PRICE on existing BAM files?

First of all, reads must be mapped to the genome using a splice aware read mapper. Mapping against the transcriptome (i.e. all mRNA sequences) is not supported. The genome assembly must be the same as supplied to PRICE. BAM files must be sorted according to position (samtools sort xyz.bam) and indexed (samtools index xyz.bam).

In addition, there are two important things to consider:

  1. Reads have to be mapped in end-to-end mode (which is e.g. not the default for STAR!); otherwise, any untemplated nucleotide addition will be soft-clipped (instead of appearing as a mismatch). PRICE utilized the information of untemplated additions in building its model and inferring codons!
  2. The following attributes have to be present in the BAM file: MD (To determine the mismatch types), NH (To determine mapping ambiguities)

If you want to use STAR, you have to use (at least) the following parameters: --outSAMattributes MD NH --alignEndsType EndToEnd

What if my organisms does not have a proper annotation (i.e. full length mRNAs, like yeast or viruses)?

Use the -novelTranscripts parameter!

You can’t perform that action at this time.