# Part 3.4: Introduction to transcriptome-wide analysis of translational efficiency

## Sections:
   - 3.4.1 Translational efficiency, translational control, and biological relevance.
   - 3.4.2 Bioinformatics perspective: software and models for TE analysis (high-level introduction).

## Questions & Objectives:
   - What is translational control, how RNA- and Ribo-seq together overcome limitations of classical expression studies (RNA-seq alone).
   - Where to start if I have matched RNA- and Ribo-seq data and a simple experimental design (treated vs. control)?

## After I will be able to:
   - Have a better understanding of how to quantify translational control, and how to use and visualise the results.    
   

## 3.4.1 Translational efficiency, translational control, and biological relevance

An important measure of translational control is translational efficiency (TE), defined as the level of protein production per mRNA. We generally assume that TE measures translation initiation efficiency in the steady state, and that it is in fact the rate-limiting factor for translation.

In practice (and in the literature), TE is calculated as the ratio of the ribosome density from Ribo-seq to the mRNA abundance, measured by RNA-seq. Since the Ribo-seq counts mapped on the coding region of a given gene depends on both the mRNA abundance and its rate of translation, in a comparison between two conditions, differential translation can be characterized by the dissimilarity between the changes in RNA-seq and Ribo-seq counts across the two conditions. 

Increased translation could be due to many different factors, including an increase in the abundance of the protein-coding transcript, an increase in translation efficiency due to increased number of ribosomes binding to each transcript, the rate at which the ribosomes move along the transcript, or a combination of these factors. In general, we differentiate between *transcriptional regulation*, where both RNA-seq and Ribo-seq are concordant, and *translational regulation*, where only Ribo-seq is significant.

It is now recognised that the correlation between transcript abundance and protein levels is generally poor, which is partly due to the regulation of gene expression at the level of translation. This kind of analysis thus provide a more accurate and complete picture of gene expression, as that traditionally given by RNA-seq alone. This is important particularly *e.g.* in experimental time series designs, where transcriptional and translational regulation can vary over time, and thus reveal dynamic aspects of gene expression.


## 3.4.2 Bioinformatics perspective: software and models for TE analysis (high-level introduction).

A number of dedicated tools exist to perform TE analysis: Babel, Riborex, RiboDiff (negative-binomial-based GLM models), Xtail (negative binomial model, using DESeq2, estimates a posterior for TE), most recent Scikit-ribo (ribosome A-site prediction, TE inference using GLM with ridge penalty)

Our workflow is based on the periodicity estimates made by `rpbp` (Part 3.1). We use the RNA-seq data, and trim the reads prior mapping to the maximum matching Ribo-seq fragment length (Ribo-seq samples are matched with RNA-seq samples). The count data is obtained by running `htseq-count`. We then calculate differences in TE as a ratio of ratios:

%%latex
\begin{align}
\frac{\left(\frac{Ribo}{RNA}\right)_{treated}}{\left(\frac{Ribo}{RNA}\right)_{control}} = \frac{\left(\frac{Ribo_{treated}}{Ribo_{control}}\right)}{\left(\frac{RNA_{treated}}{RNA_{control}}\right)}
\end{align}


This is modeled in `DESeq2` using the design `~assay+condition+assay:condition`, where the interaction term `assay:condition` represents the ratio of ratios. We use the likelihood ratio test to test whether the translational efficiency is different in treatment *vs.* control.


We have run the workflow on the course data, and we will explore the results below.


***

<div class="alert alert-block alert-danger"><b>The cells below contain "code", so we will need to run them one after the other!</b></div>


In [156]:
import pandas as pd
import numpy as np
import os
import yaml

from collections import defaultdict

from argparse import Namespace

import pbio.misc.logging_utils as logging_utils
import pbio.misc.mpl_utils as mpl_utils
import pbio.ribo.ribo_utils as ribo_utils

args = Namespace()
logger = logging_utils.get_ipython_logger()

In [158]:
# graphics

%load_ext autoreload
%autoreload 2
%matplotlib inline

import matplotlib
import matplotlib.ticker as mtick
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator

import seaborn as sns
sns.set({"ytick.direction": u'out'}, style='ticks') #color_codes=True, palette='muted')

params = {
   'axes.labelsize': 28,
   'font.size': 28,
   'legend.fontsize': 26,
   'xtick.labelsize': 26,
   'ytick.labelsize': 26,
   "lines.linewidth": 2.5,
   'text.usetex': True,
   'figure.figsize': [12, 8],
    'font.family': 'sans-serif',
    'font.sans-serif': 'DejaVu Sans',
    'mathtext.fontset': 'dejavusans'
   }
plt.rcParams.update(params)
font = FontProperties().copy()

args.fontsize = params['legend.fontsize']
args.legend_fontsize = params['legend.fontsize']
args.labelsize =params['axes.labelsize']

import logging
mpl_logger = logging.getLogger('matplotlib')
mpl_logger.setLevel(logging.WARNING) 

logger = logging.getLogger(__name__)
logger.setLevel(logging.WARNING) # root level

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [1]:
# course year
year = 2022

In [162]:
# the DGE (TE) results
args.dirloc = f'/pub/hbigs_course_{year}/part3_Riboseq/dgeRiboHBIGS{year}-analysis/tables'

# xlrd issues... (we don't want to mess with the install!)
# args.results = 'condition_EGF_vs_PBS.xlsx' # we don't use the shrunken FC
# df = pd.read_excel(os.path.join(args.dirloc, args.results), sheet_name='EGF_vs_PBS')

args.results = 'condition_EGF_vs_PBS.csv' # we don't use the shrunken FC
df = pd.read_csv(os.path.join(args.dirloc, args.results))

In [163]:
padj_thr = 0.1
lfc_thr = np.log2(1.5)

In [164]:
m_padj_rna = df['padj.rna'] < padj_thr
m_padj_ribo = df['padj.ribo'] < padj_thr
m_padj_inter = df['padj.inter'] < padj_thr

m_lfc_rna = abs(df['log2FC.rna']) >= lfc_thr
m_lfc_ribo = abs(df['log2FC.ribo']) >= lfc_thr
m_lfc_inter = abs(df['log2FC.inter']) >= lfc_thr

df.loc[m_padj_rna & m_lfc_rna & ~(m_lfc_ribo & m_padj_ribo), 'hue'] = 'RNA only'
df.loc[(m_padj_ribo & m_lfc_ribo) & ~(m_lfc_rna & m_padj_rna), 'hue'] = 'Ribo only'
df.loc[m_padj_rna & m_padj_ribo & m_lfc_rna & m_lfc_ribo, 'hue'] = 'Concordant (RNA+Ribo)'

df.loc[m_padj_inter & m_lfc_inter, 'hue'] = 'Interaction (translational regulation)'

# doesn't matter, it's only for the style!
df.loc[df['hue']=='Interaction (translational regulation)', 'style'] = 'cross'
df.loc[~(df['hue']=='Interaction (translational regulation)'), 'style'] = 'dot'

df.loc[df['hue'].isna(), 'hue'] = 'Unchanged' # not significant

ho = {'Unchanged':'grey', 
      'Concordant (RNA+Ribo)': 
      'black', 'RNA only': 'blue', 
      'Ribo only': 'red', 
      'Interaction (translational regulation)': 'red'}


In [173]:
ax = plt.gca()

g = sns.scatterplot(x="log2FC.rna", 
                    y="log2FC.ribo", 
                    hue="hue", 
                    style='style', 
                    data=df, 
                    s=100, 
                    legend=False,
                    palette=ho)
# some tricks...!
ax.spines['left'].set_position('zero')
ax.spines['right'].set_color('none')
ax.spines['bottom'].set_position('zero')
ax.spines['top'].set_color('none')

ax.yaxis.get_major_ticks()[2].label1.set_visible(False)
ax.xaxis.get_major_ticks()[2].label1.set_visible(False)


In the paper, they show that changes in protein synthesis correlated with variations in TE, with minimal change in RNA abundance for a certain classes of genes. In this cluster, there was a lot of ribosomal proteins and several translation initiation and elongation factors.

Let's see what we get for the significant terms.


In [172]:
# get the interaction terms
genes = df[m_padj_inter & m_lfc_inter].symbol.values
for g in genes:
    print(g)

We can now briefly explore tools such as Gene Ontology (GO) enrichment analysis, pathway analysis, *etc.*
Please copy the list of translationally regulate genes that we found above, and we will go to [Enrichr](https://amp.pharm.mssm.edu/Enrichr/).

In the next part of the lecture (Part 4), you will explore in more details these aspects and more.



***

MIT License (code and scripts)

Copyright (c) 2022 Etienne Boileau