# STAT 450: Case Studies in Statistics

#  Case study wrapup: Relation between mRNA and protein levels 

## Review

### Recall the data:
- tens of thousands of genes
- 12 tissues per gene
- each tissue and gene combo yields:  mRNA measurement, protein expression measurement
- some values of mRNA and protein levels are missing
- only 1,392 genes had no missing values for all 12 mRNA-protein pairs 

### Recall the client's claim

Using the median ratio of protein to mRNA levels per gene as a proxy for translation rates, our data show that [...] ***it now becomes possible to predict protein abundance in any given tissue with good accuracy from the measured mRNA abundance***


In this notebook, we explore our client's model: 

1. how does it work?
2. how do we interpret it?
3. is it appropriate?

### Load libraries and read in the data file

Let's load our data set `data/tidy_data_valuesavailable.csv`, which we saved earlier after wrangling the two original data files, and counting up how many tissues each gene has complete pairs of mrna and protein values available. 

In [None]:
# load libraries
library(tidyverse)
library(broom)
theme_set(theme_bw())

# load data
data_source <- 
    read_csv("data/tidy_data_valuesavailable.csv",show_col_types = FALSE) |>
    rename(values_available = values.available)

head(data_source)

### Subset to only genes with complete data

First we'll subset the data to keep only genes that have observations of both mrna and protein for all 12 tissues. 

In [None]:
genes_complete <- 
    data_source |>
    filter(values_available == 12)

head(genes_complete)

## Part 1.  How does the client's model work? 

**Client's claim:**

> Using the median ratio of protein to mRNA levels per gene as a proxy for translation rates, our data show that [...] ***it now becomes possible to predict protein abundance in any given tissue with good accuracy from the measured mRNA abundance***

**Client's Questions:**

- Is our analysis statistically correct?
- Is there another way to analyze the data? If so, do we get similar results?

#### **Our part**

Think about:
1. **What did the client do?**
    - In this case, what is this median ratio approach? 
    - Can we reproduce their analysis?  

<br>

2. **Is there another (better) way to do the analysis?**
    - Do we get the same conclusions?

----------------------------------------------------------------------------

### What did the client do?

Let $g$ stand for the gene and $t$ stand for the tissue type. So, ${\text{protein}}_{gt} $  is the protein level in gene $g$ in tissue $t$.

The client's model equation: 
$$
{\rm{protein}}_{gt}  \approx  \beta_g \times ({\rm{ mRNA }}_{gt}) $$

The client's estimator: 
$$\hat{\beta}_g  = {\rm{  median ~of~ the ~12 ~ratios~of~}} 
\frac{{\rm{protein}}_{gt}}{ {\rm{ mRNA }}_{gt}} $$

The client predicts:

$$
\widehat{{\rm{protein}}}_{gt}  = \hat{\beta}_g \times ({\rm{ mRNA }}_{gt})$$

Let's try it for our 1,392 genes with no missing values.

In [None]:
genes_client <- 
    genes_complete |>
    group_by(gene) |>
    mutate(ratio = protein / mrna, 
           ratio_med = median(ratio, na.rm = TRUE),
           pred_model_client = mrna * ratio_med)

head(genes_client)

### Let's plot the data with the estimated lines

`geom_abline` adds a line/lines with given slopes, intercepts.
- The length of the intercept must be 1 or the number of rows in the tibble.  
- Same for slope.

<h5 style="color:red; font-weight:bold;">Exercise: make a plot for 4 random genes:  scatterplot plus client's line</h5> 

In [None]:
## your code goes here (fill in the ...)

set.seed(450) # don't change this line - for reproducibility of selecting 4 random genes

genes_client |> 
    ungroup() |>
    filter(gene %in% sample(unique(gene), 4)) |>
    ggplot(aes(x = ..., y = ...)) + 
    geom_...() + 
    facet_wrap(~..., scales = 'free') + 
    geom_abline(aes(intercept = 0, slope = ratio_med))

Do the lines go through the origins? 

<h5 style="color:red; font-weight:bold;">Exercise: Make the same plot but include the origin to check.</h5> 

Trick:  just add `xlim(0,NA)` and `ylim(0,NA)`

In [None]:
## your code goes here


Why didn't we start off considering the client's approach?

- EDA showed us that intercepts were NOT always equal to 0  

- least squares regression is preferred for it's theoretical framework (can perform hypothesis tests!)

## Part 2: How do we interpret the client's model?

Now that we see how these lines were estimated, let's think about what they mean. They are a sort of regression line with no intercept, **but** the slopes are calculated in a very peculiar way (NOT OLS). 

### Interpretation A of claim: gene by gene analysis 

>"Using the median ratio of protein to mRNA levels per gene as a proxy for translation rates"

For a specific gene, can we predict the protein level from mRNA across all 12 tissues?  

We've explored gene-by-gene linear least squares regression analysis in previous lectures, and found:

- many gene correlations between mRNA and protein were small (some negative, some positive)
- there was very little evidence for a linear relationship between mRNA and protein, even after log-transformation
- predictive ability was poor when assessed with cross-validation

As a review, here's one of the models we considered, which includes an intercept, and allows for gene-specific variances:

In [None]:
lm_gene <- genes_complete %>%
  group_by(gene) %>%
  group_modify(~ tidy(lm(protein ~ mrna, data = .x))) 
lm_gene %>% head()

How many genes showed a significant linear relationship at 0.05 level after adjustment for multiple comparisons (using FDR = "False Discovery Rate")?

In [None]:
p_gene <- lm_gene %>% 
  filter(grepl("mrna", term)) %>% 
  pull(p.value) 

sum(p.adjust(p_gene, method = "BH") < ...)

Aside: FDR controls the expected proportion of hypothesis rejections that are false positives. This is less conservative than something like Bonferroni, which controls the family-wise error rate (FWER): the probability of making at least one false positive error across all tests conducted.

An informative diagnostic for examining validity of p-values across many (in this case thousands) of tests is a histogram of the p-values. Under the null, this should be uniformly distributed between 0 and 1. An enrichment of significant tests would show up as a spike near zero. Any other patterns can indicate that things have gone awry (e.g. some assumptions of the test are violated). [This blog post](http://varianceexplained.org/statistics/interpreting-pvalue-histogram/) on this issue is a great resource. 

In [None]:
hist(p_gene)

### Interpretation B of claim: tissue by tissue analysis 

>"it now becomes possible to predict protein abundance in any given tissue with good accuracy from the measured mRNA abundance"

For a specific tissue (e.g. kidney), can we predict protein level from mRNA for thousands of genes?

In [None]:
genes_complete |>
  filter(tissue == 'kidney') |>
  ggplot(aes(mrna, protein)) +
  geom_point() +
  scale_x_log10() + scale_y_log10() +
  geom_point(size = 3, alpha = 0.5) + 
  theme_bw() +
  theme(text = element_text(size = 24), element_line(linewidth = 1)) 

mRNA is a moderately strong predictor of protein level within tissue. Note that looking across genes the data is considerably right-skewed so we apply a log transformation. 

In [None]:
lm_tissue <- genes_complete %>%
  group_by(tissue) %>%
  group_modify(~ tidy(lm(log10(protein) ~ log10(mrna), data = .x))) 
lm_tissue

All 12 tissues show a significant linear relationship between gene log transformed mrna value and protein level, event after adjustment for multiple comparisons.

In [None]:
p_tissue <- lm_tissue %>% 
  filter(grepl("mrna", term)) %>% 
  pull(p.value) 

sum(p.adjust(p_tissue) < 0.05)

p_tissue

## Part 3: Is the client's model appropriate?

### Which is appropriate?  A or B?  

- This is a subject area question
- The statistician needs to work with the researcher so that both understand how the model fits the question
- For this particular problem, it turns out A is much more biologically useful (but challenging!!)


### What did the client do?

1. Carried out a **gene-by-gene** analysis (with no intercept and with a funny slope estimate) and obtained fitted values from the **gene-by-gene** models
2. Assessed fits by acting as if this were **tissue-by-tissue** analysis, reporting correlation between protein values and fitted values for tissue


### Client mixed interpretations A and B!

1. get predicted protein values from gene-by-gene regression
2. calculate correlation between actual protein values and the gene-by-gene fitted protein values *across all genes in a tissue*

Make up your mind! Gene-by-gene? Tissue-by-tissue?

### Gene-by-gene fits across all genes in one tissue

Let's replicate what they did using our linear modeling approach. First, we'll fetch fitted protein level values from the gene by gene regressions we fit above.

In [None]:
genes_kidney <-
  genes_complete |>
  filter(tissue == 'kidney')

genes_kidney$gene_fit <- NA
for(g in genes_kidney$gene){
    coefs <- lm_gene |> filter(gene == g) |> pull(estimate)
    mrna <- genes_kidney %>% filter(gene == g) |> pull(mrna)
    genes_kidney$gene_fit[genes_kidney$gene == g] <- coefs[1] + mrna*coefs[2]
}
head(genes_kidney)

Next, we'll calculate correlations across all genes within a single tissue - **we get an extremely high correlation!!!**

In [None]:
# Pearson
cor(genes_kidney$protein, genes_kidney$gene_fit) |> round(3)

# Spearman
cor(genes_kidney$protein, genes_kidney$gene_fit, method = "spearman") |> round(3)

Why do we get a higher correlation with Pearson correlation? Let's visualize these results:

In [None]:
ggplot(genes_kidney, aes(protein, gene_fit)) +
  geom_point(size = 3, alpha = 0.5) + 
  theme(text = element_text(size = 24), element_line(linewidth = 1)) +
  xlab("protein") + ylab("fitted protein") + 
  ggtitle("Kidney Gene-by-Gene Fits")

Let's view this relationship on the log-scale (and add a small pseudocount since we have some genes with predicted negative protein level)

In [None]:
ggplot(genes_kidney, aes(protein+1e-3, gene_fit+1e-3)) +
  geom_point(size = 3, alpha = 0.5) + 
  theme(text = element_text(size = 24), element_line(linewidth = 1)) +
  xlab("protein") + ylab("fitted protein") + 
  ggtitle("Kidney Gene-by-Gene Fits (log)") +
  scale_x_log10() +
  scale_y_log10()

## What is happening? 

From [Fortelny, Overall, Pavlidis and Cohen Freue (Nature, 2014)](https://www.nature.com/articles/nature23293):

> "...we show that it is in fact possible to achieve a high correlation across genes without using any mRNA levels ..."

> "The high correlations ... are driven by the large degree of variation in protein levels between genes....  This generates a high correlation between predicted and observed protein levels across genes even when these correlations are low for individual genes."

**This is an example of Simpson's paradox**: the trend within groups (here genes) is different than the trend across groups

 ![](img/simpsons.gif)

## Main takeaways

- Choosing an appropriate statistical method to answer a scientific question is an iterative and challenging process
- We may come across data that has been analyzed in unadvisable ways
- Translating statistical output "honestly" into lay person's terms is not easy
- Statisticians have the duty to deal with all of the above

----