# Part 1.1: Virtual environments and installation of packages hosted on GitHub

## Sections:
   - 1.1.1 Short introduction to Python virtual environments: how does a virtual environments work, installing and using a virtual environment.
   - 1.1.2 Installation of packages hosted on GitHub.
   - 1.1.3 Short introduction to Jupyter (Notebook) and the JupyterHub
   
## Questions & Objectives:
   - Why do we need a virtual environment?
   - Learn how to install and use a Python virtual environment.
   - Learn how to install a package hosted on GitHub, in particular how to install the rpbp package.
   - Jupyter Notebook basics: dashboard, user interface, navigation, running code, etc. How to use Jupyter with virtual environments.
   - How to use the JupyterHub.
   
### After I will be able to:
   - create and use a Python virtual environment to install packages.
   - understand what is a Jupyter Notebook, and how to use it.

## 1.1.1 Short introduction to Python virtual environments

A *virtual environment* is a tool to help you keep dependencies and packages required by different projects separate and isolated from the system-wide installation.

### Why do we need a virtual environment? 
***
    
Imagine a scenario where you need different versions of the same package, say v1.8 and v2.5. These would normally reside in the same directory with the same name...

    Virtual environments enable to isolate different versions of the same package (if they live in different environments, say myproject1.8 and myproject2.5).
    
Imagine a scenario where things go wrong (you break something, install many conflicting packages) ... 

    Virtual environments can be deleted and re-created easily without affecting your system or the other packages that are installed outside.
    
Imagine a scenario where you cannot install packages system-wide or do not want to do it...

    Virtual environments enable you to install everything that you want, without affecting the system or other packages that are installed outside.
    
    
### How does a virtual environment work?
***

We use a module named `venv` which is a tool to create isolated Python environments. A virtual environment is a directory tree (a folder structure) which contains Python executables and other necessary files. These are ideally isolated from system site directories. Each virtual environment has its own Python installation and can have its own independent set of installed Python packages.

When working in a command shell, to create a virtual environment (assuming you have Python3 installed!):

`
python3 -m venv /path/to/new/virtual/environment
`
 
To activate the new virtual environment:

`
source /path/to/virtual/environment/bin/activate
`

### Let's try it!
***

We will create a fresh virtual environment and eventually install packages using the script `ribo-setup`, located
under *hbigs_course_2019/ribosome-profiling*.


### More
***

For information about Python virtual environments, see the [venv](https://docs.python.org/3/library/venv.html#module-venv) documentation. See also [PEP 405](https://www.python.org/dev/peps/pep-0405/).


## 1.1.2 Installation of packages hosted on GitHub

### What is GitHub?
***

Git is an open-source version control system, *i.e.* used to track changes to documents/code for yourself or amongst collaborators, releasing versions, *etc.* GitHub.com is where (mostly) developers store their projects, and render them accessible to the community. Anyone (even people who have nothing to do with the development of a project) can download the files and use them (according to the license, *e.g* the `rpbp` package, as well as these notebooks/scripts and the original material for this part of the course is under the MIT license). Project files are
stored in a particular location, referred to as a repository (usually abbreviated to *repo*), and you can access it with a unique URL. 

### How to install a package hosted on GitHub?
***

Some packages hosted on GitHub can be installed via package-management systems (`pip` for Python, RStudio Package Manager, *etc.*). In these cases, you may not even know that the source codes for these packages "live" on GitHub. Sometimes, however, you need to install the package by first *cloning* the repository and following specific instructions.

![](img/git-1.png)


### Let's try it!
***

After creating a new virtual environment for this part of the course, we will install the `rpbp` and `slurm-magic` packages using the script `ribo-setup`, located under *hbigs_course_2019/ribosome-profiling*.


### More
***

In the last part of the course, we will go through a collection of explanations and short practical exercises about version control (Git and GitHub) and open source software. 


## 1.1.3 Short introduction to Jupyter (Notebook) and the JupyterHub

The Jupyter Notebook is an open-source web-based application allowing in-browser editing (Jupyter is running on your own computer, *i.e* your computer acts as the server), combining text, code, computations and rich media output (notebook documents). Jupyter supports many programming languages, including Python, R, and Julia. The JupyterHub is a multi-user version of the Notebook (this notebook is served via "our Hub" `https://jupyter.dieterichlab.org:49200`).

### Basic workflow
***

Typically, a notebook document is organised into cells, and one moves forward from one cell to the next, breaking the content or the computation into separate parts. This workflow allows to "validate" the output of one cell before moving to the next, and is also convenient for interactive exploration.

### Jupyter Notebook basics
***

The Notebook dashboard: when you first start the notebook server, your browser will open to the notebook dashboard. To create a new notebook, click on the "New" button and select a kernel from the dropdown menu. The running notebooks are shown with a green icon and text (or via the "Running tab"). Notebooks remain running until you explicitly shut them down; closing the notebook's page is not sufficient. 
   
![](img/jupyter-1.png)
    
To shutdown, delete, duplicate, or rename a notebook check the checkbox next to it. You can also perform these operations and more directly on the running notebook by using the top menu and tool bar (see at the top of this notebook!) 

![](img/jupyter-2.png)

Modal editor: **edit mode**. Edit mode is indicated by a green cell border and a prompt showing in the editor area. When a cell is in edit mode, you can type into the cell, like a normal text editor. Enter edit mode by pressing `Enter` or using the mouse to click on a cell.

![](img/jupyter-3.png)

Modal editor: **command mode**. Command mode is indicated by a grey cell border with a blue left margin. When you are in command mode, you are able to edit the notebook as a whole, but not type into individual cells (be careful to type into a cell in command mode!). Enter command mode by pressing `Esc` or using the mouse to click outside a cell.
    
![](img/jupyter-4.png)

### Basic commands
***

Edit mode

| Command | action |
|------|------|
|   `Ctrl-Enter`  | run selected cells |
|   `Shift-Enter`  | run cell, select below |

Command mode

| Command | action |
|------|------|
|   `Enter`  | enter edit mode |
|   `Esc`  | enter command mode |
|   `a`  | insert cell above |
|   `b`  | insert cell below |


## More
***

- See [Project Jupyter](https://jupyter.org/) for installation instructions, detailed documentation, *etc.*
- After the course, explore the menu (Help) of this notebook, and experiment with basic commands.


## How to use Jupyter with virtual environments

We will now go back to your Desktop, open an `LXTerminal` and navigate to the directory for this part of the course by typing: 

`
cd ~/hbigs_course_2019/ribosome-profiling
`

You will see there this notebook and others, as well as a file called `ribo-setup`. We will run this script to create a new virtual ennvironment, clone the `rpbp` and `slurm-magic` repositories, install the packages and add a jupyter kernel for the newly created environment. To run the script from the terminal, type:

`
chmod +x ribo-setup; ./ribo-setup
`

When the script has finished running, we will need to (1) refresh the page, (2) go to "Kernel" in the top menu bar, (3) "Change kernel" in the dropdown menu list, and then select `hbigs19-ribo`, which is the name of our newly created environment.

![](img/jupyter-5.png)


We are now ready to go...


# Part 1.2: Introduction to ribosome-profiling (Ribo-seq) and the Rp-Bp workflow

For practical reasons, we will cover some technical aspects before introducing ribosome-profiling in more details. In particular, we will first set-up our notebook to run `rpbp` on the example Ribo-seq dataset, briefly introduce the `Slurm` workload manager (via `slurm-magic`), and actually run the `rpbp` pipeline. While our data is running, we will then go into more details about Ribo-seq, the `rpbp` package and some methodological aspects behind it.

## Sections:

   - 1.2.1 Overview of the `rpbp` pipeline: command line options and configuration file
   - 1.2.2 Very short introduction to the `Slurm` workload manager (`slurm-magic`), used to run the rpbp pipeline
   
   
   - 1.2.3 High-level introduction to Ribo-seq, de novo ORF discovery (elements of annotation, transcript isoforms, CDS, UTRs, etc.), biological relevance of alternative translation events (including translation from non-coding transcripts), and why we need "dedicated software" to analyse Ribo-seq data.
   - 1.2.4 The `rpbp` pipeline step-by-step:
         - Creating reference genome indices;
         - Running the pipeline: creating ORF profiles, predicting translated ORFs.

## Questions & Objectives:
   - What is the translatome? What are the uses of Ribo-seq.
   - Why do we need dedicated software to analyse Ribo-seq data?
   - What softwares are available to analyse Ribo-seq data?
   - Understand how to use the `rpbp` package (on my laptop, on the cluster using the `Slurm` workload manager).
   - Run the complete rpbp pipeline on a selected Ribo-seq dataset.
   
### After I will be able to:
   - understand how to analyse Ribo-seq data for ORF discovery;
   - run the rpbp package (only ORF profiles, or full pipeline).

## 1.2.1 Overview of the rpbp pipeline: command line options and configuration file

The `rpbp` pipeline consists of an index creation step, which must be performed once for each genome and set of annotations, and a two-phase prediction pipeline, which must be performed for each sample. In the first phase of the prediction pipeline, a ORF profiles are created. In the second phase, the ORFs which show evidence of translation are identified.

### Creating reference genome indices
***

The entire index creation process can be run automatically using the following command:

`
prepare-rpbp-genome <config> [--overwrite] [logging options] [processing options]
`

See [Creating reference genome indices](https://rp-bp.readthedocs.io/en/latest/usage-instructions.html#creating-reference-genome-indices) for detailed information. To save time, we have already created the indices for the human genome (GRCh38.96). Go to your `LXTerminal` and navigate to these files

`
cd ~/hbigs_course_2019/ribosome-profiling/genomes
`

We will briefly explain the index creation step by examining these files.


### Running the pipeline
***

The entire `rpbp` pipeline (2 steps) can be run on a set of riboseq samples, including any biological replicates. To run the pipeline, we first need to prepare a configuration file, consisting in a series of required (and optional) key: value pairs. We will explain this below.

Lastly, to run the pipeline on the cluster, we will use the `Slurm` workload manager. `Slurm` is a job scheduler. For our purpose, it suffices to know that it provides a framework to ask for resources and execute our job on the cluster. To submit our job, we will use the command `sbatch`, and to monitor the status of our job, we will use `squeue -u username`. We will actually use `slurm-magic` commands, which implement special commands to interact with `Slurm`.




***
<font color=red>**Note** The cells below contain "code", so we will need to run them one after the other.</font> 

In [None]:
# import modules that are needed to run this notebook

import os
import sys
import pandas as pd
import numpy as np

%load_ext slurm_magic


In [25]:
# some functions, definitions, etc. that are needed to run this notebook

from IPython.core.magic import register_line_cell_magic

@register_line_cell_magic
def writefile_globals(line, cell):
    with open(line, 'w') as f:
        f.write(cell.format(**globals()))

In [2]:
# $PATH is an environment variable related to file location that are typically used to run programs.
# When one types a command to run, the system looks for this command in the directories specified by $PATH.

# What is on you $PATH?
!echo $PATH 


In [3]:
# We will now add the location of certain programs that are required to our $PATH,
# including executable scripts that were installed with the rpbp package.

# Advanced note: sys.path does not contain path to virtual environment executables, and adding to sys.path
# does not solve the problem...

# first find where we are, we use the same structure as in "ribo-setup" 
HOME = !echo $HOME
HOME = HOME.n
PARENT = HOME + '/hbigs_course_2019/ribosome-profiling'
# this is the location where rpbp-related executables were installed
ENVLOC = PARENT + '/envs/hbigs19-ribo/bin'
# these are standard bioinformatics tools that you have already used, which are installed
# at these locations on the cluster, we use specific versions to be compatible with the rpbp package
add2path = ['/biosw/slurm/18.08.6.2/bin',
            '/biosw/bowtie2/2.3.0',
            '/biosw/star/2.6.1d',
            '/biosw/samtools/1.7/bin',
            '/biosw/flexbar/3.5.0']
add2path.append(ENVLOC)

PATH = !echo $PATH
PATH.extend(add2path)
PATH = ':'.join(PATH)
# out updated $PATH is...
%set_env PATH=$PATH


In [4]:
# Prepare to run the example (downsampled) dataset: 4 replicates, 2 PBS, 2 EGF.
# change directory
DIRLOC = PARENT + '/riboSeqHBIGS19-downsampled-analysis'
CFG = DIRLOC + '/config'
RES = DIRLOC + '/riboseq-results'
%cd $CFG


In [None]:
# We first need to prepare a YAML configuration file to run the rpbp package. 
# YAML ("YAML Ain't Markup Language") is a data-serialization (markup) language.

# We will do this below and explain the structure of the config file. Please check that
# the file has been correctly written to disk under /riboSeqHBIGS19-downsampled-analysis/config

In [26]:
%%writefile_globals hbigs19-downsampled.yaml

project_name: HBIGS19-downsampled

# Base location for the created index files.=
genome_base_path: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/homo_sapiens/GRCh38_96
    
# An identifier which will be used in the filenames. This should not contain
# spaces, forward slashes, or other special characters.
genome_name: GRCh38.96
    
# The full path to the GTF file which contains the exon and CDS annotations.
gtf: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/homo_sapiens/GRCh38_96/GRCh38.96.gtf

# The fasta file which contains the genome. The genomic identifiers in the GTF and
# fasta files must match (e.g., "I" and "I", or "chrI" and "chrI", but not "I" and "chrI").
fasta: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/homo_sapiens/GRCh38_96/GRCh38_96.fa

# The base location for the STAR genome index.
star_index: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/homo_sapiens/GRCh38_96/star

# The base location for the Bowtie2 index for the ribosomal sequences.
ribosomal_index: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/homo_sapiens/rRNA_cluster_plus_mtRNA/rRNA_cluster_plus_mtRNA
# The fasta file containing the rRNA sequences. The file can also contain other
# sequences which should be filtered, such as tRNA or snoRNAs
ribosomal_fasta: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/homo_sapiens/rRNA_cluster_plus_mtRNA/rRNA_cluster_plus_mtRNA.fasta

# A file containing standard adapters.
adapter_file: /beegfs/pub/hbigs_course_2019/ribosome-profiling/genomes/riboseq-adapters.fa

# The base location for the created files
riboseq_data: {RES}

# A dictionary in which each entry specifies a sample. The key is an 
# informative name about the sample, and the value gives the complete path to 
# the sequencing file (a fastq(.gz) file). The names will be used to 
# construct filenames, so they should not contain spaces, forward slashes, or 
# other special characters.
riboseq_samples:
 dSRR7451194.EGF.rep-1: /beegfs/pub/hbigs_course_2019/raw-data/downsampled/dSRR7451194_1.fastq.gz
 dSRR7451184.EGF.rep-2: /beegfs/pub/hbigs_course_2019/raw-data/downsampled/dSRR7451184_1.fastq.gz
 dSRR7451191.PBS.rep-1: /beegfs/pub/hbigs_course_2019/raw-data/downsampled/dSRR7451191_1.fastq.gz
 dSRR7451197.PBS.rep-2: /beegfs/pub/hbigs_course_2019/raw-data/downsampled/dSRR7451197_1.fastq.gz

riboseq_biological_replicates:
 EGF:
  - dSRR7451194.EGF.rep-1
  - dSRR7451184.EGF.rep-2
 PBS:
  - dSRR7451191.PBS.rep-1
  - dSRR7451197.PBS.rep-2

riboseq_sample_name_map:
 dSRR7451194.EGF.rep-1: EGF1
 dSRR7451184.EGF.rep-2: EGF2
 dSRR7451191.PBS.rep-1: PBS1
 dSRR7451197.PBS.rep-2: PBS2

# Rp-Bp options: we need to change the default parameters to run the downsampled data.
# Generally, you do not need to change the default parameters!

# The number of bases upstream of the translation initiation site to begin 
# constructing the metagene profile.
metagene_start_upstream: 50
# The number of bases downstream of the translation initiation site to end 
# the metagene profile.
metagene_start_downstream: 50
# The number of bases upstream of the translation termination site to begin 
# constructing the metagene profile.
metagene_end_upstream: 50
# The number of bases downstream of the translation termination site to end 
# the metagene profile.
metagene_end_downstream: 50

# N.B. These values are set artificially low for the example to work!
min_metagene_profile_count: 50
min_metagene_image_count: 10

# N.B. These value are set low to reduce the running time, but will affect the results.
metagene_iterations: 100
translation_iterations: 100


In [None]:
# We are now ready to submit our job. We use the Slurm workload manager.

In [5]:
%%sbatch
#!/bin/bash
#SBATCH -J "hbigs19"
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --mem=20G

run-all-rpbp-instances hbigs19-downsampled.yaml \
    --merge-replicates \
    --run-replicates \
    --keep-intermediate-files \
    --num-cpus 12 \
    --mem 120G \
    --use-slurm \
    --logging-level INFO \
    --log-file hbigs19-downsampled.log


In [1]:
# We check that our job is actually running...
%squeue -u course01

## 1.2.3 High-level introduction to Ribo-seq

The idea behind the Ribo-seq protocol can be summarised in a few lines: ribosome-protected fragments or footprints (RNA protected by the ribosome), can be isolated through the use of nucleases that degrade unprotected RNA regions, and submitted to a deep sequencing protocol similar to those used for RNA. In brief, *(i)* ribosome-bound RNA is isolated from cell/tissue lysates; *(ii)* treated with a drug, depending on the purpose of the experiment, *e.g* elongating (cycloheximide) or initiating ribosomes (harringtonine, lactimidomycin/puromycin), for eukaryotes, the choice of inhibitor has a concentration-dependent impact on the kinetics of initiation and elongation; *(iii)* nucleases are added to digest the unprotected RNA (RNase), the choice of nuclease and treatment strongly affect the ribosome profiles; *(iv)* after footprint recovery, rRNA depletion is performed and samples are sequenced, but protocols vary a lot (circularisation step or no, bias reduction methods, *etc*), and there are always new methods! ([Ligation-free ribosome profiling](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1005-1), [RiboLace](https://www.sciencedirect.com/science/article/pii/S2211124718315444?via%3Dihub), to name but a few) 

By mapping the position of translating ribosomes over the entire transcriptome, Ribo-seq provides a snapshot of the entire translatome. Ribo-seq has been used to answer a wide range of questions, including identification of translated small open reading frames (ORFs), non-coding sequences and alternative reading frames, quantification of translational control (in combination with RNA-seq, *i.e* translational efficiency, or regulatory mechanisms associated with the translation process itself, such as upstream ORFs acting in cis, or translation regulating transcript stability by triggering NMD via recognition of a premature stop codon), or for gaining mechanistic insights on the translation process itself.

### Open reading frame discovery
***

We understand an open reading frame (ORF) as a potentially translatable sequence that consists of a series of codons beginning with a start codon and ending with a stop codon. Translatable ORFs can be found anywhere: in the 5' untranslated region (5'UTR), in the 3' untranslated region (3'UTR), within or overlapping with annotated coding sequences (CDSs), in transcripts that were previously thought to be non-coding (lincRNAs, antisense, pseudogene, or other processed transcripts), or in novel transcripts (intra/intergenic).

![](img/ribo-1.png)

As we will briefly explain below, in the `rpbp` workflow, ORFs are labeled according to their position and exon structure relative to the annotations. Except for annotated CDSs (Canonical), the assignment of these labels depend on the complexity of the annotations, and thus are not "fixed".


| Label | Description |
|------|------|
|   **Canonical**  |  An ORF that coincides with an annotated coding sequence (CDS from a protein coding transcript) |
|   **Can. (variant)**  | An ORF that is in-frame (with respect to a CDS), N-terminus extended or truncated |
|   **uORF** or five prime |  An ORF that is in the 5'UTR of a CDS and do not overlap other CDS on the same strand (from alternative transcripts) |
|   **dORF** or three prime | An ORF that is in the 3'UTR of a CDS and do not overlap other CDS on the same strand (from alternative transcripts) |
|   **ncORF** or noncoding | An ORF that originates from a transcript not annotated as coding (non-coding, processed transcript or pseudogene) |



An overlapping uORF is an out-of-frame ORF overlapping on the same transcript both the 5'UTR and the canonical coding sequence. An overlapping dORF is an out-of-frame ORF overlapping on the same transcript both the 3'UTR and the canonical coding sequence. 



#### Small open reading frames and biological relevance of alternative translation events
***

We refer to a small ORF (**sORF**) as an ORF that contains less than 100 amino acids. 

Accumulated evidence from Ribo-seq (and proteomics) experiments over the last 5 years suggests that small open reading frames-encoded peptides are underpredicted, even for well-annotated model species. sORFs can be found over a wide variety of transcripts. So far, a number of sORFs encoding functional peptides have been identified, such as Sarcolamban, Myoregulin, Myomixer, Minion, MOXI, MOTS-c, SPAR, or NoBody, to name but a few.


Tools such as `rpbp` can be used as part of a *sORF discovery workflow*, as depicted below:



![](img/ribo-2.png)


### More
***

Many recent publications address the role of alternative translation events:
- [S. van Heesch et al. The Translational Landscape of the Human Heart](https://www.sciencedirect.com/science/article/pii/S0092867419305082)
- [J Ruiz-Orera & M. MarAlbà, Translation of Small Open Reading Frames: Roles in Regulation and Evolutionary Innovation](https://www.sciencedirect.com/science/article/pii/S0168952518302221)



### Why do we need "dedicated software" to analyse Ribo-seq data?
***





******* workflow

alba

Figure 1. Workflow to Identify Actively Translated ORFs. Ribosome-protected RNA fragments are sequenced and
mapped to all annotated transcripts in the species genome. On the basis of the three-nucleotide periodicity of the reads,
and additional features, different software can predict actively translated ORFs. The prediction is based on a high fraction of
reads in the correct frame (in-frame, blue colour) when compared to alternative ones (off-frame, red colour). This step is
performed with single-nucleotide resolution after computing the read P-site per each read length. Abbreviations: ORF,
open reading frame.


** then methods


How-
ever, it has also been frequently shown that the ribosome
occupancy itself, as indicated by the RPF reads mapped
on the transcriptome, is not sufficient for calling of the
active translation, given the possible noise from the data
processing and experimental procedures, regulatory RNAs
that bind with the ribosome, and ribosome engagement
without translation (16,17). 


Owing to its subcodon resolution, ribosome profiling re-
veals the precise locations of the peptidyl-site (P-site) of
the 80S ribosome in the RPF reads, given that the exper-
iment itself was properly performed and the RPF reads
were correctly filtered. Aligned by their P-site positions, the
RPF reads resulted from the translating ribosomes should
therefore exhibit 3-nt periodicity along the ORF, which is
the strongest evidence of active translation. Only recently
have different strategies been developed to assess the trans-
lation by testing the distribution of ribosome engagement
at the subcodon resolution (11,12,18–23). These methods
have been comprehensively reviewed in (24). Some of these
methods used the strategy of machine learning, which re-
quires prior annotation of the known coding transcripts
for training of the model (12,21). Like many supervised
methods in general, the results of these methods heavily
rely on the pre-annotated training set, source of a poten-
tial intrinsic bias. On the other hand, only a couple of other
methods were designed for de novo translatome annota-
tion by directly assessing the 3-nt periodicity, and these in-
clude the strategy of ORFscore (11), RiboTaper (18) and

and PRICE

RP-BP (22). In the present study, we have developed a
new statistically vigorous method, RiboCode, for the de
novo annotation of the full translatome by quantitatively
assessing the 3-nt periodicity (Figure 1).

In [None]:
Workflow Demo

1.2.4 The rpbp pipeline step-by-step:
        - Creating reference genome indices;
        - Running the pipeline: creating ORF profiles, predicting translated ORFs.

In [None]:
# PART 2 QC IN>TRO



Pre-Processing and Quality Control
A characteristic feature of a high-quality Ribo-seq library is its distinct read-length distribution,
which usually peaks at 29 nt in eukaryotic cytosolic ribosomes, reflecting the size of a
translating ribosome on the RNA [10]. A broader distribution of reads has been observed in
variants of the protocol, depending on the nuclease treatment [26]. An additional shorter
footprint of 20 nt can be detected when performing Ribo-seq in absence of CHX or in

presence of different inhibitors [29]. Distinct read-length distributions can also correspond to
other ribosomal conformations [30] or to ribosomes belonging to different subcellular compart-
ments. Mitochondrial ribosomes have been shown to display a bimodal distribution of read
lengths, peaking at 27 and 33 nt, thus showing a clear difference when compared to cytosolic-
derived RPFs [21] (Figure 1A).

Beyond Read-Counts:
Ribo-seq Data Analysis to
Understand the Functions of
the Transcriptome
Lorenzo Calviello1,2 and Uwe Ohler



Depending on the efficiency of the rRNA removal step, a high percentage of
reads consists of small structured RNAs (rRNAs, tRNAs, or snoRNAs), which should be
removed because their overabundance can skew subsequent quantification. As we are
sequencing a pool of RNA fragments, a splice-aware alignment such as STARix [32] or
others [33] can be used. RPFs are short (29 nt), and many reads will map to multiple
locations. To solve this, different ‘rescue’ strategies are used in popular RNA-seq quantifi-
cation tools [34,35], but they are typically embedded inside a larger workflow for transcript
quantification. Alternatively, the alignments can be filtered either by using specific tools [36] or
by extracting one primary alignment per read [16]. In a high-quality Ribo-seq library, reads
mostly map to coding sequence (CDS) regions (usually >85%) and 50 -UTRs (5–10%), and
0very few to 3 -UTRs (Figure 1B). Signals coming from introns and intergenic regions are
usually the result of multi-mapping fragments.


The distribution of aligned Ribo-seq reads over the translated ORFs is dependent on the
kinetics of the translation process: the assembly of the initiation complex is a relatively slow
process, resulting in a pronounced accumulation of signal around the start codon. In most
datasets an additional accumulation can be observed at the last codon of the ORF, caused
by the slow kinetics of translation termination and peptide release. In aggregate profiles of
RPF 50 ends over annotated start and stop codons, it is possible to appreciate the single-
nucleotide resolution of Ribo-seq data (Figure 1C): in most datasets, especially the more
recent ones [11,12,16,37], 50 ends accumulate on one of the possible three frames, thus
revealing the translated frame. This level of resolution at the subcodon level is usually
accompanied by a distinct offset of the 50 ends relative to the annotated start codons
(12 nt for many datasets). This distance can be used to shift the positions of Ribo-seq
reads and monitor translation at each translated codon, reflecting the positions of the P-site
compartment for millions of ribosomal footprints. Aggregate profiles can drastically vary
between different read lengths.


In [12]:
%%latex

\begin{align}
a && b && c \\
1 && 2 && 3
\end{align}

<IPython.core.display.Latex object>

In [2]:
#################################
#################################
#################################

In [5]:
# graphics

%load_ext autoreload
%autoreload 2
%matplotlib inline

import matplotlib
import matplotlib.ticker as mtick
import matplotlib.patches as patches
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator

import seaborn as sns
sns.set({"ytick.direction": u'out'}, style='ticks')

params = {
   'axes.labelsize': 26,
   'font.size': 26,
   'legend.fontsize': 26,
   'xtick.labelsize': 24,
   'ytick.labelsize': 24,
   'text.usetex': True,
   'figure.figsize': [12, 8],
    'font.family': 'sans-serif',
    'font.sans-serif': 'DejaVu Sans',
    'mathtext.fontset': 'dejavusans'
   }
plt.rcParams.update(params)
font = FontProperties().copy()

mpl_logger = logging.getLogger('matplotlib')
mpl_logger.setLevel(logging.WARNING) 


DEBUG    : Loaded backend module://ipykernel.pylab.backend_inline version unknown.


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


MIT License

Copyright (c) 2019 Etienne Boileau