# STAT 450: Case Studies in Statistics

### This notebook has two parts:  
- Part I explains Jupyter Notebooks.  
- Part II introduces a case study we will use throughout the course. There are some exercises throughout to help you check your knowledge.  These exercises are not turned in.

## Part I: Jupyter Notebooks

Welcome to Jupyter Notebooks! 

If you had never seen a Jupyter Notebook before, surprise! 

Jupyter Notebook is an awesome tool for data exploration and analysis. 
The secret is that the Jupyter Notebooks allow you to have text, code, 
run the code, and see the output of the code, all in the same document. 
Imagine, in the same document, you can:

1. Explain the problem and the data;
3. Load the data;
4. Make plots;
5. Interpret the plots;
6. Run your model; 
7. Interpret your model;
8. Run other models; 
9. Compare the different models; 

Incredible, isn't it? 

That's not all; Jupyter Notebooks are not language-dependent either (although we will be working exclusively with R kernel in this course). 
So, it is well worth for you to spend some time getting familiar with it.

Jupyter notebooks are based on cells, where you can input text or code. This a cell.

This is another cell. 

This is a third cell. 

Some info about cells: 

- There's no limit on how much text you can put in a cell. Although you probably don't want a cell to get too long.
- You can create, delete, and move cells around in a notebook. 
- There are mainly two types of cells: (1) markdown cells; and (2) code cells

### 1.1 Creating Cells

There are multiple ways to create cells: 

1. you can click on the `+` sign at the top of the notebook: 

<img src="img/jupyter-tutorial-01.png" width=100%>

2. you can press `a` to create a cell above and `b` to create a cell below. (Make sure you are out of the edit mode - you can use the `Escape` key to exit edit mode)

### 1.2 Choosing the type of cells

Note that when you create a new cell, the cell type is shown at the top bar.

<img src="img/jupyter-tutorial-02.png" width=100%>

You can click there to change between **Markdown** and **code** cells.

Alternatively, you can select the cell you want and 

- press `m` to make it a markdown cell
- press `y` to make it a code cell.

#### **Exercise 1.1**

Create two cells below this one. One of the cells must be a Markdown cell, and the other must be a code cell. 

### 1.3 Markdown cells

Markdown cells are the text cells. All the cells before Exercise 1.1 are Markdown cells. 
They are called Markdown because they are formatted using Markdown. 

You can do a bunch of text formating with Markdown very quickly using only plain text input. The most common markdown operations:

- **Italic:** put an _ (underline) at the start and end of the content to be italicized. 
    - \_this is italic\_ results in _this is italic_.
- **Bold:** put two \* at the start and end of the content
    - \*\*this is bold\*\* results in **this is bold**
- **Headers:** Put a \# at the start of the line you want a header.
    - \# is level 1 header;
    - \#\# is a level 2 header, and so on.


**bullet points**: we can also see how a list is created from this cell. 

- **inline code**: we can easily format words as part of code using \` (backtick) at the start and end.
    - \`function(){\` results in `function(){`

- **code block**: you can use three backticks \`\`\` to create a code block. You even get to specify the language for syntax highlighting. 

\`\`\`r

foo <- function(){

    print("Hello STAT 450")
    
}
foo()

\`\`\`

results in

```r
foo <- function(){
    print("Hello STAT 450")
}
foo()
```

Note that the text formatted as code is not actually *evaluated* as code, since it is in a Markdown cell.

 - **table**: we can also quickly create tables with Markdown. 
 
 Student |  Course  | Grade | 
 --------|----------|-------|
 Keegan | STAT 450 |  C-   |
 Melissa | STAT 450 |  A+   |
 Rodolfo   | STAT 450 |  A+   |
Chloe | STAT 450 |  A+   |

- **images**: we can also quickly load images using Markdown. Try checking a cell with an image to figure out the syntax.
- **more**: [here](https://www.markdownguide.org/cheat-sheet/) is a handy cheatsheet of these and some other Markdown operations


### 1.4 Code cells

In our course, we will be using R code.

In [1]:
course <- "STAT 450" 
course

In [2]:
(x <- 10 + 15)

### 1.5 Running cells

To run a cell, you can click on the play button on the top of the notebook. 

<img src="img/jupyter-tutorial-03.png" width=100%>

Alternatively, 

- `Shift+Enter` will run the current cell and move the cursor to the next. If there's no next cell, a new one will be created.
- `Ctrl+Enter` (or `Command+Enter` on a Mac) will run the current cell but not move the cursor. 

#### **Exercise 1.2**

Create a code cell below and calculate the result of `7162386` divided by 3.

### 1.6 Restarting and running all cells

Because Jupyter Notebooks allow you to run cells in any order, some bugs might be introduced when you are creating the notebook. 

You should frequently restart your notebook and re-run all the cells to make sure your notebook runs from start to finish. 

<img src="img/jupyter-tutorial-04.png" width=100%>


### 1.7 Navigation Menu

Here you can navigate to find the notebooks (or create new notebooks) in whatever subfolder you want. 
You can open multiple notebooks simultaneously. 

<img src="img/jupyter-tutorial-05.png" width=60%>


## Part II:  Case study: Relation between mRNA and protein levels 
###  First mRNA-protein notebook contains an introduction:
-   the biology 
-  the questions
-   reading in the data 
-  checking various things in the data
-  Student exercises at the end to check understanding (not to be handed in)

### Get to know the area  (youtube, wiki)
Background information (take a look after class):
- background on protein synthesis:  watch one or both:  https://www.youtube.com/watch?v=oefAI2x2CQM  or
part 3 of  https://youtu.be/NDIJexTT9j0?t=522 (starting at 8:42 min)
- information on exploratory data analysis: Chapter 4 of the Art of Data Science: https://bookdown.org/rdpeng/artofdatascience/


The picture illustrates what is known as the Central Dogma of Biology

<img src="img/prot_gene.png" width=60%>

## The biology

Every cell in our body has our genetic material, our genes on our DNA. Some genes act as instructions to make molecules called proteins.  

**Transcription**:  Within the cell nucleus, a "protein recipe" is transferred from a gene to mRNA (messenger RNA).  This process is called transcription.  

**Translation**:  The mRNA takes the recipe outside of the nucleus, where amino acids follow the recipe to form a protein.  This process is called translation.

![](https://upload.wikimedia.org/wikipedia/commons/6/68/Central_Dogma_of_Molecular_Biochemistry_with_Enzymes.jpg)


One might expect high protein levels to go with high mRNA levels.  Despite this expectation of a high correlation between mRNA and protein levels, experimental results have shown very low correlation values.

In 2014, a research group claimed to find a "predictive model", which can be used to predict protein from mRNA!!

We'll use data from this group submitted to the Journal as if it is "our client's data".


### Measuring RNA and protein 

There are many different ways to quantify mRNA and protein levels. Here is some background information on the measures used in this case study. 

RNA
- RNA-seq (RNA-sequencing) is a technology that provides counts of short transcript fragements, known as **reads**
- Reads are then 'mapped' back to the genome to summarize as gene-level counts
- Complication: not able to count every fragment present, so we have random sampling
  - counts depend on total number of reads for each sample (**depth**)
  - FPKM (Fragments Per Kilobase of transcript per Million mapped reads) is a way to 'normalize' counts so they are comparable across samples with different depth

Protein
- Mass spectrometry is a technology that measures mass-to-charge ratio of particles in a sample
  - different proteins have different mass-to-charge ratios
- Intensity at each ratio gives information about how abundant each protein is
  - iBAQ (intensity Based Absolute Quantitation) is one way to convert intensities to an absolute measures of abundance



### References
- Original article: [Wilhelm et al. *Mass-spectrometry-based draft of the human proteome.* Nature 2014.](https://www.nature.com/articles/nature13319)

- Article finding flaws with the above: [Fortelny et al. *Can we predict protein from mRNA levels?*  Nature 2017.](https://www.nature.com/articles/nature23293)

- Some more background: [Liu et al. *On the dependency of cellular protein levels on mRNA abundance.*
Cell 2016](https://doi.org/10.1016/j.cell.2016.03.014)


## Data
- Two data files:  
  - `proteinUN.csv`  (protein data)
  - `geneUN.csv`  (mrna data)
- Each file contains information on 6104 genes (the same genes in each file)
- For each gene, we have measurements on 12 tissue types
- Thus, for each gene, we have 12 pairs of measurements:  protein level and mRNA level

### Client's claim

Using the median ratio of protein to mRNA levels per gene as a proxy for translation rates, our data show that:
>***"it now becomes possible to predict protein abundance in any given tissue with good accuracy from the measured mRNA abundance"***

### Client's Questions

- Is our analysis statistically correct?

- Is there another way to analyze the data? If so, do we get similar results?

### Our approach

What did the client do?  What is this median ratio approach? Can we reproduce their analysis?  

Do we get the same conclusions?
 
Is there another (better) way to do the analysis?

![](img/nature_res.png)

Both plots above come from [Wilhelm et al. (2014)](https://www.nature.com/articles/nature13319): Left is taken from Supplementary Figure 7a, and right is taken from Figure 5a.

## From [The Art of Data Science (by Peng and Matsui)](https://bookdown.org/rdpeng/artofdatascience/)

### [Chapter 4: Exploratory Data Analysis: Checklist](https://bookdown.org/rdpeng/artofdatascience/exploratory-data-analysis-checklist-a-case-study.html)

>1. Formulate your question
>2. Read in your data
>3. Check the packaging
>4. Look at the top and the bottom of your data
>5. Check your “n”s
>6. Validate with at least one external data source
>7. Make a plot
>8. Try the easy solution first
>9. Follow up


### 1. Formulate your question

Can you predict protein levels from mRNA levels?  
More loosely:  are protein levels and mRNA levels related?  (We may need to fine tune this.)

  
### 2.  Read in your data. Load libraries and read data files, `proteinUN.csv`, `geneUN.csv`.  


First take a look at the data files directly (in directory `lectures/release/data`).

Next, load any necessary libraries. Here for illustration we'll load a library called `tidyverse` that actually includes several handy libraries within it. You'll learn more about them as we go.

In [3]:
library(tidyverse)

── [1mAttaching packages[22m ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.0      [32m✔[39m [34mpurrr  [39m 1.0.0 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.5.0 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Next read in the two CSV data files as `data.frames` and check the contents of both.

In [4]:
mrna <- read.csv("data/geneUN.csv", row.names = 1)
prot <- read.csv("data/proteinUN.csv", row.names = 1)

### 3. Check the packaging

Let's check the dimension of each object. Note that another useful function that can tell us about the packaging is `str` - try it!

In [5]:
dim(mrna)
dim(prot)

Looks like both objects have 6104 genes and 12 tissues. Notice anything else?

Do the column names of `prot`  agree with the column names of `mrna`?

In [6]:
colnames(mrna)
colnames(prot)
colnames(prot) == colnames(mrna)

Let's check rownames:  Do I want to use the same functions to check 6104 rownames?  

In [7]:
# can we do better than this?
rownames(prot) == rownames(mrna)

### Note:

`prot[i,j]` = prot level of gene i in tissue j

`mrna[i,j]` = mrna level of gene i in tissue j

### 4. Check top and bottom of files

Here we want to check for any inconsistencies or surprises in reading in the data (e.g. characters appearing mixed with numeric values, incomplete rows at the end)

In [8]:
head(mrna)
head(prot)

Unnamed: 0_level_0,uterus,kidney,testis,pancreas,stomach,prostate,ovary,thyroid.gland,adrenal.gland,salivary.gland,spleen,esophagus
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000000003,2.5e-05,4.44e-05,5.79e-05,1.43e-05,2.01e-05,3.04e-05,5.6e-05,3.04e-05,2.16e-05,5.58e-05,9.9e-06,4.76e-05
ENSG00000000419,3.44e-05,3.43e-05,3.8e-05,2.11e-05,2.96e-05,3.4e-05,3.17e-05,4.54e-05,5.01e-05,2.44e-05,2.96e-05,4.28e-05
ENSG00000000457,7.8e-06,6.8e-06,7.6e-06,4.1e-06,7.7e-06,8.6e-06,8.2e-06,8.2e-06,6.6e-06,6.5e-06,7.4e-06,7.2e-06
ENSG00000000971,1.42e-05,1.93e-05,1.89e-05,6.9e-06,4.71e-05,2.42e-05,2.42e-05,3.31e-05,4.53e-05,2.23e-05,1.14e-05,6.76e-05
ENSG00000001036,3.46e-05,5.39e-05,1.9e-05,2.34e-05,4.75e-05,2.98e-05,3.45e-05,4.54e-05,4.73e-05,1.65e-05,2.94e-05,2.05e-05
ENSG00000001084,1.92e-05,2.27e-05,1.26e-05,6.9e-06,2.95e-05,2.39e-05,6.2e-06,2.76e-05,1.87e-05,9.4e-06,3.43e-05,3.03e-05


Unnamed: 0_level_0,uterus,kidney,testis,pancreas,stomach,prostate,ovary,thyroid.gland,adrenal.gland,salivary.gland,spleen,esophagus
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000000003,,,,,9.157831e-07,,,,,9.027262e-06,,
ENSG00000000419,9.966484e-06,1.894265e-05,4.257907e-05,1.575567e-06,3.282342e-05,2.119753e-05,3.916784e-05,3.930635e-06,1.0165e-05,4.048574e-05,5.092213e-06,1.607405e-05
ENSG00000000457,,,6.752985e-07,4.704207e-07,3.745549e-06,,,,3.059918e-07,7.894253e-06,,
ENSG00000000971,3.633516e-05,0.0003358335,0.0001848077,0.0002956261,0.0001561922,0.0001848922,0.0002033825,7.749283e-05,0.0001049859,0.0001306766,0.0001150577,0.0005391352
ENSG00000001036,1.681633e-05,9.71197e-07,4.784997e-05,,3.802391e-06,,,,2.015687e-05,1.048615e-06,1.547352e-05,
ENSG00000001084,1.693588e-05,1.967169e-05,1.01259e-05,2.687924e-05,0.0001255965,3.874807e-05,1.274565e-05,1.077153e-05,1.63447e-05,4.268563e-05,0.0001111609,0.0001012831


In [9]:
tail(mrna)
tail(prot)

Unnamed: 0_level_0,uterus,kidney,testis,pancreas,stomach,prostate,ovary,thyroid.gland,adrenal.gland,salivary.gland,spleen,esophagus
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000262814,1.65e-05,3.72e-05,1.49e-05,1.47e-05,2.46e-05,2.18e-05,1.39e-05,1.53e-05,2.06e-05,2.55e-05,2.01e-05,2.44e-05
ENSG00000266964,7.76e-05,1.23e-05,9.9e-06,7.4e-06,1.23e-05,6.24e-05,9.64e-05,3.7e-06,2.24e-05,1.25e-05,1.61e-05,4.7e-05
ENSG00000267673,1.21e-05,1.94e-05,1.09e-05,1.08e-05,7.6e-06,1.16e-05,1.41e-05,2.4e-05,1.65e-05,1.23e-05,1.03e-05,1.41e-05
ENSG00000269190,2.01e-05,3.22e-05,4.8e-06,5.4e-06,2.3e-06,1.52e-05,2.62e-05,1.9e-06,4.8e-06,6.2e-06,3.4e-06,7.8e-06
ENSG00000271303,1.14e-05,1.55e-05,9.8e-06,6.3e-06,1.6e-05,1.28e-05,1.4e-05,1.63e-05,3.37e-05,1.17e-05,8.5e-06,2.93e-05
ENSG00000272325,1.33e-05,1.37e-05,9.5e-06,5.1e-06,9.4e-06,1.1e-05,1.71e-05,2.54e-05,1.2e-05,7.8e-06,1.43e-05,1.35e-05


Unnamed: 0_level_0,uterus,kidney,testis,pancreas,stomach,prostate,ovary,thyroid.gland,adrenal.gland,salivary.gland,spleen,esophagus
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000262814,7.439914e-07,0.0001433128,1.69385e-05,,5.800116e-06,,1.741214e-06,,1.221164e-05,3.169555e-05,,1.759668e-05
ENSG00000266964,,,,,,1.767627e-05,1.813579e-05,,,,,
ENSG00000267673,,1.574266e-05,1.543045e-06,,5.093142e-06,2.489128e-06,1.358377e-05,,2.214073e-05,9.995027e-06,,
ENSG00000269190,1.803568e-06,1.402511e-05,2.034178e-06,1.091202e-06,1.650044e-05,3.441421e-06,5.647713e-06,6.894839e-07,2.460341e-05,3.089573e-05,1.067625e-05,3.703505e-06
ENSG00000271303,4.685853e-06,,2.669675e-06,,,5.633086e-06,6.527088e-06,2.554554e-06,1.873942e-05,,,1.312187e-05
ENSG00000272325,2.001929e-05,6.37647e-06,6.180226e-05,1.40227e-05,5.605276e-05,7.248613e-05,0.0001571054,3.089959e-05,1.282666e-05,3.657967e-05,1.566443e-05,2.574201e-05


### 5. Check your "n"s

Let's make make sure that we actually have 6104 *unique* genes and 12 *unique* columns in `mrna`.

In [10]:
length(unique(rownames(mrna)))
length(unique(colnames(mrna)))

Let's do the same thing for `prot`, but change up our coding style to the *tidyverse* way using the *pipe* operator `%>%`:

In [11]:
rownames(prot) %>% unique() %>% length()
colnames(prot) %>% unique() %>% length()

This style tends to be more readable when performing a series of actions on the same object.

### Exercise 2: Reading in files

#### Exercise 2.1. ##
Read the two files `data/prot.mini.csv`  and `data/mrna.mini.csv`

In [12]:
# Replace ... with code
prot.mini <- read.csv("data/prot.mini.csv", row.names = 1)
mrna.mini <- ...(..., row.names = 1)

ERROR: Error in ...(..., row.names = 1): could not find function "..."


#### Exercise 2.2
What happens if we don't use `row.names = 1` when we read in `prot.mini`?

In [None]:
# Replace ... with code
prot2.mini <- ...
row.names(prot2.mini) %>% head()

#### Exercise 2.3 ##
Check the beginning of each data matrix (`prot.mini`, `mrna.mini`).

In [None]:
# Your code here (hint: use the head() function)

#### Exercise 2.4 ##
What are the dimensions of each data matrix? 

In [None]:
# Your code here (hint: use the head() function)

#### Exercise 2.5 ##
How many genes does each data matrix contain?

In [None]:
# Your code here (hint: use the nrow() function)

#### Exercise 2.6 ##
Do the two data matrices have the same column and row names, in the same order?

In [None]:
# Replace ... with code
all(colnames(...) == colnames(...))
all(rownames(prot.mini) == ... )

# Exploring the original large data files a little bit



## Are there missing values in the data set?  (easy way to look and a fancy way to look)

Easy way - look at the values. Note that the `slice` function pulls out selected rows. 

In [None]:
# Check prot: the easy way - look at some
slice(prot, 1:5)

This becomes the hard way, though if we have a large dataset!This becomes the hard way, though if we have a large dataset!

Fancy way - using `is.na`

How many NAs are in the first  5 rows of prot?

In [None]:
# First, gain an understanding for is.na():
slice(prot, 1:5) %>% is.na()

In [None]:
slice(prot, 1:5) %>% is.na() %>% sum()

How many protein values are missing in all?

In [None]:
prot %>% is.na() %>% sum()



### Exercise 3: Counting missing data

#### Exercise 3.1 ##
How many missing mRNA values are there?   (You should get 5509 missing values.) 

In [None]:
### Replace ... with code.
... %>% is.na() %>% sum()

#### Exercise 3.2 ## 
How many mRNA values are missing for uterus tissues? (You should get 432 missing values.)
Hint: use the `select` function to pull out the relevant column from `mrna`

In [None]:
##  Replace ... with code
... %>% is.na() %>% sum()