# STAT 450: Case Studies in Statistics

## Example Case study: Relation between mRNA and protein levels 

### Lecture pre-reading/watching

1. Background video on protein synthesis (9 minutes): https://www.youtube.com/watch?v=oefAI2x2CQM 

<img src="img/videoscreenshot.png" width=40%>

2. Background reading on exploratory data analysis: [Chapter 4 of the Art of Data Science](https://bookdown.org/rdpeng/artofdatascience/exploratory-data-analysis.html)


###  Outline for today:
1. Biology basics 
2. Client's data
3. Scientific question and client's claim
4. Initial checking of the data

There are some exercises built in to check our understanding (not to be handed in).

## Section 1. Biology basics: DNA to RNA to Protein

Every cell in our body has our genetic material, our genes on our DNA. Some genes act as instructions to make molecules called proteins.  

**Transcription**:  Within the cell nucleus, a "protein recipe" is transferred from a gene to mRNA (messenger RNA).  This process is called transcription.  

**Translation**:  The mRNA takes the recipe outside of the nucleus, where amino acids follow the recipe to form a protein.  This process is called translation.


This picture illustrates what is known as the Central Dogma of Biology:

![](https://upload.wikimedia.org/wikipedia/commons/6/68/Central_Dogma_of_Molecular_Biochemistry_with_Enzymes.jpg)

Because proteins are synthesized by translation from RNA, one might expect that high RNA levels should lead to high protein levels.  
<img src="img/translation.png" width=60%>

Despite this expectation of a high correlation between RNA and protein levels, experimental results have shown very low correlation values.

<img src="img/articlelowcorr.png" width=70%>

In 2014, a research group claimed to find a "predictive model", which can be used to predict protein from RNA!

<img src="img/articleclient.png" width=70%>

Quote from the article by [Wilhelm et al. (Nature 2014)](https://www.nature.com/articles/nature13319)

>"**...it now becomes possible to predict protein abundance in any given tissue with good accuracy from the measured mRNA abundance.**"

We'll use data from this group submitted to the Journal as if it is "our client's data".


## Section 2. Client's data

### Measuring RNA and protein 

There are many different ways to quantify mRNA and protein levels. Here is some background information on the measures used in this case study.

#### RNA measurements
- RNA-seq (RNA-sequencing) is a technology that provides counts of short transcript fragments, known as **reads**
- Reads are then 'mapped' back to the genome to summarize as gene-level counts
- Complications: there are duplicate fragments, and not able to count every fragment present, so we have random sampling
  - counts depend on total number of reads for each sample (depth)
  - FPKM (Fragments Per Kilobase of transcript per Million mapped reads) is a way to 'normalize' counts so they are comparable across samples with different depth

#### Protein measurements
- Mass spectrometry is a technology that measures mass-to-charge ratio of particles in a sample
  - different proteins have different mass-to-charge ratios
- Intensity at each ratio gives information about how abundant each protein is
  - iBAQ (intensity Based Absolute Quantitation) is one way to convert intensities to an estimate absolute measures of abundance

**Question**: Could either of these quantities be influenced by *random variation* or *uncertainty*?

### The data files
- Two data files are provided alongside this notebook:  
  - `proteinUN.csv`  (protein data)
  - `geneUN.csv`  (mrna data)
- Each file contains information on 6104 genes (the same genes in each file)
- For each gene, we have measurements on 12 tissue types
- Thus, for each gene, we have 12 pairs of measurements:  protein level and mRNA level

## Section 3. Scientific question and client's claim

### Scientific question: Can we predict protein level from RNA with good accuracy?

### Client's claim

>*"Using the median ratio of protein to mRNA levels per gene as a proxy for translation rates, our data show that, it now becomes possible to predict protein abundance in any given tissue with good accuracy from the measured mRNA abundance"*

![](img/nature_res.png)

Both plots above come from [Wilhelm et al. (2014)](https://www.nature.com/articles/nature13319): Left is taken from Supplementary Figure 7a, and right is taken from Figure 5a.

### Client's questions for us

1. Is our analysis statistically correct?

2. Is there another way to analyze the data? If so, do we get similar results?

### Our approach

What did the client do? What is this 'median ratio' approach? 

Can we reproduce their analysis? Do we get the same conclusions?
 
Is there another (better) way to do the analysis?

## Section 4. Initial checking of the data

### Exploratory Data Analysis Checklist

From Chapter 4 of [The Art of Data Science (by Peng and Matsui)](https://bookdown.org/rdpeng/artofdatascience/)

>1. Formulate your question
>2. Read in your data
>3. Check the packaging
>4. Look at the top and the bottom of your data
>5. Check your “n”s
>6. Validate with at least one external data source
>7. Make a plot
>8. Try the easy solution first
>9. Follow up

We'll go through the first several steps, as well as perform some additional dataset-specific explorations.


### 1. Formulate your question

Can you predict protein levels from mRNA levels?  
More loosely:  are protein levels and mRNA levels related?  (We may need to fine tune this.)

  
### 2.  Read in your data. Load libraries and read data files, `proteinUN.csv`, `geneUN.csv`.  


First take a look at the data files directly (in directory `lectures/data`).

Next, load any necessary libraries. Here for illustration we'll load a library called `tidyverse` that actually includes several handy libraries within it. You'll learn more about them as we go.

In [None]:
library(tidyverse)

Next read in the two CSV data files with the `read_csv` function (since we have `.csv` files)

In [None]:
mrna <-
  read_csv("data/geneUN.csv") 

prot <-
  read_csv("data/proteinUN.csv") 

### 3. Check the packaging

Let's check the dimension of each object. Let's check the dimension of each object. Note that another useful function that can tell us about the packaging is `str` - try it!

In [None]:
dim(mrna)
dim(prot)

Looks like both files have 6104 rows and 13 tissues.

Do the column names of `prot`  agree with the column names of `mrna`? Notice anything else?

In [None]:
colnames(mrna)
colnames(prot)
colnames(prot) == colnames(mrna)

Let's check rownames:  Do I want to use the same functions to check 6104 rownames? Try the `all` function instead.

In [None]:
# can we do better than this? 
rownames(prot) == rownames(mrna)

### 4. Check top and bottom of files

Here we want to check for any inconsistencies or surprises in reading in the data (e.g. characters appearing mixed with numeric values, incomplete rows at the end, missingness, unclear sample or feature names, etc...)

In [None]:
head(mrna)
head(prot)

In [None]:
tail(mrna)
tail(prot)

Luckily for us, we checked the file. We notice a couple of things. First, it looks like we have some missing values (we'll come back to this later). In addition, the first column has a weird name. But looking at the contents, we recognize those as the gene names! Not to worry, we can use the rename function to rename a column.

In [None]:
prot <-
  prot %>%
  rename(gene = ...1)
mrna <-
  mrna %>%
  rename(gene = ...1)

head(prot)
head(mrna)

That's better!

### 5. Check your "n"s

Let's make make sure that we actually have 6104 *unique* genes and 13 *unique* columns in `mrna`.

In [None]:
length(unique(rownames(mrna)))
length(unique(colnames(mrna)))

Let's do the same thing for `prot`, but change up our coding style to the *tidyverse* way using the *pipe* operator `%>%`:

In [None]:
# rewrite the previous using the pipe

This style tends to be more readable when performing a series of actions on the same object.

### Try it yourself! Practice these checks so far using mini data files

#### Exercise 1 ##

Read the two files `data/prot.mini.csv`  and `data/mrna.mini.csv`

In [None]:
mrna.mini <-
  read_csv("data/mrna.mini.csv") %>%
  rename(gene = ...1)

# do the same thing for prot.mini
prot.mini <- # your code here

#### Exercise 2 ##
Check the beginning of each data matrix (`prot.mini`, `mrna.mini`).

In [None]:
# your code here

#### Exercise 3 ##
What are the dimensions of each data matrix? 

In [None]:
# your code here

#### Exercise 4 ##
How many genes does each data matrix contain?

In [None]:
# your code here

#### Exercise 5 ##
Do the two data matrices have the same column and row names, in the same order?

In [None]:
# your code here

### 6. Validate with at least one external data source

Example - we could examine the literature to double check that the reported ranges of mRNA and protein levels measured by these technologies align with our expectation. In addition, we could check that certain tissues expected to have high levels of a particular mRNA or protein indeed have higher levels than other tissues in the data.

### 7. Make a plot!

After exploring the data a little bit, many questions start popping up! The client is your ally to conduct insightful analysis. So, discuss questions that you have with the client. For example: 

Do you want to look at correlations *per genes* (n <= 12) or *across genes* (n <= 6104)?? Does even make sense doing otherwise? 

Let's do some exploration on how things look within a single gene. We'll look at the gene named ENSG00000000419 and make a scatterplot of RNA vs. protein. 


#### But wait...
Currently, our values are in two separate data.frames! Of course, one could work with the data sets separately. But the chances of making a mistake are enormous. It will be much better if we had, in the same data the protein and mRNA measurements for each organ and each gene.

What's more, we have observations that are spread over the columns, and variables that are spread over the rows. Ideally, we want our data to be tidy! In simple terms, this means: <u><em>"each row is an observation and each column is a variable"</em></u>. 

Let's think about our case here. We want to check the level of protein (or mrna) for each pair (gene, tissue). So, we actually are not talking about one row per gene, but instead, one row per pair (gene, tissue). This means that our columns are different observations, and should be in rows. 

Not to worry! The `pivot_longer` function helps us do that.

Resource on `pivot_longer` function: https://datasciencebook.ca/wrangling.html#tidying-up-going-from-wide-to-long-using-pivot_longer

In [None]:
tidy_prot <-
    prot %>% 
    pivot_longer(
        !gene, #the columns you want to gather to put in rows (!gene means all except gene).
        names_to = "tissue", # the name of column that will hold the columns (organs)
        values_to = "protein" # the name ofthe column that will have the values
    )

tidy_prot %>% head()

#### Exercise 6 

Now pivot the mrna data.frame so it also has one row per gene and tissue combination.

In [None]:
## Put your code here!

Now, we're finally ready to **join** the RNA and protein tidy tibbles!

![](img/inner_join_R4DS.png)

_credits: image drawn from "R for Data Science - Wickham H., Grolemund G. - available at: https://r4ds.had.co.nz/index.html"_

In [None]:
tidy_data <- tidy_mrna %>% 
    inner_join(tidy_prot)

tidy_data %>% head()

Let's look at the number of observations per tissue.

In [None]:
tidy_data %>% group_by(tissue) %>% 
    summarize(n = n())

#### Exercise 7
Find the number of observations per gene. How many observations should you expect to see per gene?

In [None]:
## Put your code here

Finally, let's pull out the gene named ENSG00000000419 and make a scatterplot of protein vs. mrna!!

In [None]:
tidy_data  %>% 
    filter(gene == "ENSG00000000419")  %>% 
    ggplot(aes(x = mrna, y = protein)) + 
    geom_point() + 
    ggtitle("ENSG00000000419")

Note that each point corresponds to a tissue. Let's add the tissue as a colour.

In [None]:
tidy_data %>% filter(gene == "ENSG00000000419") %>%
    ggplot(aes(x = mrna, y = protein, color = tissue)) + 
    geom_point() + ggtitle("ENSG00000000419") 

#### Exercise 8

Find the correlation between protein and mrna values for this gene.

In [None]:
## find the correlation

Interesting. Note that we've only looked at one gene so far. And we haven't explored the implications of the missing data. We'll come back to this in future lectures.

### References
- Original article: [Wilhelm et al. *Mass-spectrometry-based draft of the human proteome.* Nature 2014.](https://www.nature.com/articles/nature13319)

- Article finding flaws with the above: [Fortelny et al. *Can we predict protein from mRNA levels?*  Nature 2017.](https://www.nature.com/articles/nature23293)

- Some more background on different ways of looking at correlations in protein levels with mRNA abundance: [Liu et al. *On the dependency of cellular protein levels on mRNA abundance.*
Cell 2016](https://doi.org/10.1016/j.cell.2016.03.014)
