# STAT 450: Case Studies in Statistics, January 29, 2025

# Client meeting debrief

### Two questions for discussion:

1. What aspect of the meeting was most **successful** for your team?
2. What aspect of the meeting was most **challenging** for your team?


### An important note on workflows

Make sure that any data wrangling steps you complete are carried out in R, starting from reading in the raw data.

**Do not edit the raw data from your client!**

Why? For reproducibility and transparency. Your teammates and your client need to be able to reproduce your results, and be able to see all the steps that were used to arrive at your results. If raw data is edited, then it makes it very difficult for someone to arrive at the same conclusion, and some steps are hidden.

---

# Next Project Milestone - Group Proposal

* Written document that:
    * summarizes the project
    * clearly states the objectives/scientific question
    * gives an overview of the data available 
    * outlines the statistical analysis plan from EDA to formal analysis
* Report is internal (not shared with client)
* Planned analyses do *not* have to be constrained by specific client requests
* Detailed instructions [here](https://canvas.ubc.ca/courses/151975/assignments/2019048)

---

#  Case study: Relation between mRNA and protein levels 

Back to our case study...
Our hypothetical clients have been waiting patiently as we have been performing some initial checks on our mRNA and protein expression data. 
Let's get to back to some more EDA (exploratory data analysis)!

## Recall: The Problem
- Despite expectations of a __high__ correlation between mRNA and protein levels, many researchers have studied this relationship and experimental results have shown very __low__ correlation values

- In 2014, a research group claimed to find a "predictive model", which can be used to predict protein from mRNA!!  (published in [Nature: Wilhelm et al. (2014)](https://www.nature.com/articles/nature13319))

- We will use data from this publication as if it is "our client's data"


## Client's question

Can you predict protein levels from mRNA levels? Are protein levels and mRNA levels related?

Is our analysis correct?
>*"Using the median ratio of protein to mRNA levels per gene as a proxy for translation rates, our data show that, it now becomes possible to predict protein abundance in any given tissue with good accuracy from the measured mRNA abundance"*

## Recall: from Chapter 4 of The Art of Data Science (by Peng and Matsui)

### Exploratory Data Analysis: Checklist

1. *Formulate your question* - **Can you predict protein levels from mRNA levels?**
2. *Read in your data*
3. *Check the packaging*
4. *Look at the top and the bottom of your data*
5. *Check your “n”s*
6. Validate with at least one external data source
7. Make a plot
8. Try the easy solution first
9. Follow up

## Our questions that came up last time

1. What is the median ratio of protein to mRNA levels per gene and why is it used as a proxy for translation rates? - *[we'll put this question aside for now...]*

2. Does the data align with our expectations? Do the protein levels broadly seem related to the mRNA levels (last time we only looked at one gene)?

3. How much of our data is missing?
    - Why are there missing values? 
    - are NAs a low level so the technology can't read it in?
    - does "NA" carry any additional information?

## Load libraries and read in the tidy data

Last time we created a tibble object that joined the mRNA and protein data together in tidy format. We would like to work with that tidy dataset today, but we're not going to go through all that code right now (you can open the previous notebook if you'd like to review the pivot and join steps). The tidy formatted object we created last time has been saved to a new csv file in this working directory for your convenience 😅, so we'll read it into our R session. 

In [None]:
# load libraries
library(tidyverse)

# read in previously saved tidy dataset and take a peek at it
tidy_data <-  read_csv("data/tidy_data.csv",show_col_types = FALSE)
tidy_data %>% head()

Recall the plot we made of the mRNA vs protein levels of the "ENSG00000000419" coloured by tissue:

In [None]:
tidy_data %>% filter(gene == "ENSG00000000419") %>%
    ggplot(aes(x = mrna, y = protein, color = tissue)) + 
    geom_point() + ggtitle("ENSG00000000419") 

Are there any missing values for gene "ENSG00000000419"? How would you answer that?

## Exploring missing values in the data set 

Last time we saw that there are some missing values in our dataset. Now we will investigate this further. First, we'll ask: **How many missing mRNA values are there?**

Let's answer this!

In [None]:
tidy_data %>% 
    summarize(number_mrna_missing = sum(is.na(mrna)))

<h5 style="color:red; font-weight:bold;"> Exercise 1: </h5>

Next, modify the previous code chunk to answer the following question:
**How many missing protein values are there?** 

In [None]:
#### YOUR CODE HERE


Now that we have an idea of the total number of missing mRNA and protein values, we'll investigate how that will affect our investigation of the relationship between them. More specifically:

**How many gene-tissue combinations have missing values for mRNA and/or protein?**

Equivalently, let's find the number of complete mRNA-protein pairs for each gene.

In [None]:
complete_pairs <- tidy_data  %>% 
    group_by(gene)  %>% 
    summarize(values_available = sum(!is.na(protein) & !is.na(mrna)))

The `complete_pairs` data frame tells you the number of tissues with complete measurements per gene. 

<h5 style="color:red; font-weight:bold;"> Exercise 2: </h5>

How many genes have mRNA and protein data for all 12 tissues?  for just 1 tissue?  etc.

In other words, **what is the distribution of the number of complete mRNA/prot pairs?**

Make a table or a bar chart to describe.

In [None]:
# make a table: 

#### YOUR CODE HERE

# make a bar chart - hint: you can use geom_bar()

#### YOUR CODE HERE

## Is "missingness" related to correlations?

Our goal is examining the correlations between mRNA and protein per gene. 

But first: are these correlations affected by the number of missing values? 

The following code chunk joins our `tidy_data` tibble with the tabulation of number of complete pairs (tissues) available per gene in `complete_pairs` to make a new tibble `dat_npair` that contains, for each gene and each tissue, the protein data + mRNA data + number of complete pairs:

In [None]:
dat_npair <- tidy_data %>% 
    full_join(complete_pairs, by = "gene")

Now, let's calculate correlation values by gene. Since we need at least 3 values for a meaningful correlation value, we'll exclude genes with 2 or fewer complete pairs.

The code chunk below: 
- considers genes that have at least 3 complete mRNA-protein pairs
- calculates spearman (rank) correlation between mRNA and protein for each of these genes using complete pairs

In [None]:
dat_cor <- dat_npair %>% 
    filter(values_available >=3) %>%          ## use `filter` to retain genes with 3 or more observations
    group_by(gene, values_available)  %>%     ## use `group_by` to group data by gene and values available
    summarize(cor_g = cor(mrna,               ## use `mutate` to compute and save correlations per gene
                       protein,
                       use = "pairwise.complete.obs", 
                       method = "spearman"))

Note that we are using Spearman correlation here, which does not make any assumptions about a linear relationship! If we chose Pearson instead, we would need to examine the relevance of potential linear relationships (and consider possible transformations).

> Aside: Many R functions have an argument for how to handle missing values.
>  Look at help in R for `cor`:  use = `"everything"`, `"all.obs"`, `"complete.obs"`, `"na.or.complete"`, or `"pairwise.complete.obs"`
> - help for `cor` is written for correlation matrix  (each entry is a correlation between 2 variables)
> - From help:
>   - If use is `"complete.obs"` then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error). 
>   - `"na.or.complete"` is the same as `"complete.obs"` unless there are no complete cases, then `"na.or.complete"` gives NA.
>   - `"pairwise.complete.obs"` computes the correlation between each pair of variables using all complete pairs of observations on those variables.  (Same as `"complete.obs"` if you only have two variables, e.g. mRNA and protein.)

Let's look at the correlations in `dat_cor`.

In [None]:
head(dat_cor)
tail(dat_cor)

Now we have a new dataframe that contains information of correlations between mRNA and protein per gene and number of tissues with complete pairs.

<h5 style="color:red; font-weight:bold;"> Exercise 3: </h5>

Make a plot to examine the distributions of the correlations per gene. What information does your plot tell you? 

In [None]:
#### YOUR CODE HERE


<h5 style="color:red; font-weight:bold;"> Exercise 4: </h5>

Use boxplots to examine the relationship between number of available pairs and the correlations per gene. 

In [None]:
#### YOUR CODE HERE
# hint: you may want to convert `values_available` to a factor


### Questions for discussion  (compare the boxplots)
-  Compare the correlations for genes with 3 complete pairs, 4 complete pairs, etc.  What do you see?
-  Why do you think you see this?  Is it some problem, e.g. bias?
-  Why don't we have a boxplot for 2 complete pairs?  1 complete pair?  no complete pairs?

## Is "missingness" related to expression levels?

This may give us a clue as to whether the missing values actually represent something other than 'missing at random' (e.g. does missing actually mean the technology couldn't detect anything, so we should think of it as a 'zero'?)

Let's investigate whether the amount of missing protein values is related to the mean protein level.

In [None]:
summaries_prot <- 
    tidy_data %>%  
    group_by(gene) %>%                                  # group by gene
    summarize(mean_prot = mean(protein, na.rm = TRUE),  # calculate mean of all non-missing protein values per gene
              available_prot = sum(!is.na(protein)))    # calculate the sum of non-missing protein values per gene

summaries_prot %>% head()

Take a quick look at the distribution of number of non-missing protein values.

In [None]:
table(summaries_prot$available_prot)

Let's remove the genes that have all missing protein values.

In [None]:
summaries_prot <- summaries_prot %>%
  filter(available_prot > 0)

<h5 style="color:red; font-weight:bold;"> Exercise 5: </h5>
Use a plot of your choice to illustrate the relationship between mean protein abundance per gene and the number of observed protein values available. 


In [None]:
### YOUR CODE HERE


**Questions for Discussion**:
- What do we see in the plot?
- If the data were missing completely at random: what would we expect? 
- We know that sometimes values are missing because they are too small, they fall below a threshold of detectability.  If that is the case with the protein measurements, what would you expect to see in these boxplots?  Does that explain these boxplots? 

## Summaries and statistics are helpful to form or shape certain expectations about the data

### Summary: EDA 
EDA helps us: 
- look at the data in different ways 
- understand the problem better 
- examine the effect of missing data 
- understand the format of the data so we can manipulate it and change the format if needed 
- create new questions (e.g. are you interested in analyzing correlations per tissue instead of per gene?)
- see that correlations are highly variable for genes with few measured tissues 
- on average, correlations per gene are below 0.5 

---

## Looking forward: Model expectations

A data analyst can construct *a model* to answer to these questions. This process depends on the analyst's expectations (on how the world works and how the data was generated)

Notes from The Art of Data Science, by Peng and Matsui

>**"A data analyst creates, assesses, and refines a model, [...] using the data, to understand the real world"**


Statistics and statistical models serve two key purposes:
1. provide a quantitative summary of your data and 
2. to impose a specific structure on the population from which the data were sampled. 
    - It’s sometimes helpful to understand what a model is and why it can be useful through the illustration of extreme examples. The trivial “model” is simply no model at all.
    - Having all the data is important, but is often not very useful.


### 1. The trivial model: no model

Our client has collected data with mRNA and protein measurements across many genes and 12 tissues. However, *the raw data set(s) do not provide any summary or sense of uncertainty.* 
>**"The trivial model provides *no reduction of the data*"**

### 2. Everything beyond no-model: data reduction
Usually, we start by reducing our data to simple useful summaries (or statistics) that help us understand our data better.

Common examples of such statistics are: the sample mean, the median, the standard deviation, the maximum, etc.

The beauty of these statistics is that:

**Brainstorm:** 

- (your answer goes here)
- 
- 

Ultimately, our goal is to translate a scientific question into something we can objectively answer using a statistical model (e.g. compare a useful summary of our data to that which would represent our expectation if our hypothesis was not true). 