## Pipeline for Analysis of Proteomics data

#### Data and R pipeline provided by Hossein Fazelinia, PhD 

In today's assignment, we will be analyzing data from a Data-Independent Acquisition Mass Spectrometry (DIA-MS) experiment. In this experiment heart tissue was collected from a wild type and mutant mice to evaluate the effect of a specific metabolic enzyme on mouse development. 

As you will see, we will follow steps that are very similar to the RNAseq analysis that you learned earlier in the course with some additional steps that are focused on challenges that are specific to proteomics data as missing values and data imputation.

First, we will need to install a couple of R packages locally to your account. For that, do the following:

   - Open a Terminal and at the command prompt, type "R" followed by return (i.e., open R on the command line).
   - Once R has spawned, enter the following command:
    
    install.packages('ComplexUpset')
    
    
   - When prompted by the below, type "yes" followed by return: 
    
    Would you like to use a personal library instead? (yes/No/cancel)
    
        
   - When prompted by the below statement, type "yes" followed by return:
    
    Would you like to create a personal library ‘/home/user/R/x86_64-pc-linux-gnu-library/4.2’?
    
    
    
   - This will proceed to install the `ComplexUpset` package and will install quickly (<1 minute). You will know this is complete when you see "* DONE (ComplexUpset)" in the printed log (and you will have a fresh command prompt).
   - Next, enter the following command:
    
    BiocManager::install('msImpute')
    
    
   - This will proceed to install the `msImpute` package, along with some additional required packages. This will take some time to install (3-5 minutes). You will know this is complete when you see "* DONE (msImpute)" in the printed log (and you will have a fresh command prompt).
   
   
   - Next, enter the following command:
   
    BiocManager::install('PCAtools')
   
   
   - This will proceed to install the `PCAtools` package, along with some additional required packages. This will install pretty quickly (<1 minute). You will know this is complete when you see "* DONE (PCAtools)" in the printed log (and you will have a fresh command prompt).
   
 
   - Next, enter the following command:
   
    BiocManager::install("clusterProfiler")


   - This will proceed to install the `clusterProfiler` package, along with some additional required packages. This will install reasonably quickly (~1 minute). You will know this is complete when you see "* DONE (clusterProfiler)" in the printed log (and you will have a fresh command prompt).
 
   - Now, enter the following command:
    
    BiocManager::install("org.Mm.eg.db") 


   - This will proceed to install the `org.Mm.eg.db` package, along with some additional required packages. This will install reasonably quickly (<1 minute). You will know this is complete when you see "* DONE (org.Mm.eg.db)" in the printed log (and you will have a fresh command prompt).
 
   - Finally, you will need to restart the R kernel in this jupyter notebook. To do this, head to the menu labelled `kernel` and select the option `Restart kernel...`. It will take ~10-20 seconds to reload your notebook.
   
And with that, you should be good to proceed! 

Similar to previous assignments, we will first load the required libraries and data.

In [None]:
library(tidyverse)
library(SummarizedExperiment)
library(limma)
library(pheatmap)
library(ComplexUpset)
library(msImpute)
library(PCAtools)

We will also load some addtional accesorry functions to help us with some tasks that are specific to analysis of proteomics data. The functions come from the package https://rdrr.io/bioc/DEP/.

Run the line below to load those functions.

In [None]:
source("functions.R")

#### Loading the data and basic filtering

**Q1.** You will find two csv files in your working directory: "WP_dDIA_protein_report.csv" containing output from the program Spectronout with the calculated protein abundances and "metadata_WP.csv" with the meta information about each sample.

Load the data in both files into two tibbles: 

`raw_data` and `meta_data` 

Use `mutate_at(vars(starts_with('sample_')), as.numeric)` to convert data values in `raw_data` to numeric values.


Then, take a quick look at both data sets using `head()` or another strategy. 

**Q2.** List the unique values in the column `FastaFiles`, and `select()` this column. Then, use `unique()` to list each value only once.

As you can see, each entry in the `raw_data` table was either mapped to a list of verified mouse proteins (`UP000000589_mouse_reviewed_can_220920`), to a list of common contaminations (`MaxQuant Contaminants`) or to a combination of both. Here, we will only analyze proteins that are mapped with high confidence to verified mouse proteins.

Next, we will run through a few steps to clean up the data:
- `filter()` all proteins that do not map exclusively to the list of verified mouse proteins
- `select()` columns used for downstream analysis and samples that are specified in `meta_data`
- replace any `NaN` values with `NA` (some programs use `NaN` instead of `NA`)
- remove all proteins with missing values in all samples

We have provided you the code to do this; please execute the cell below:

In [None]:
raw_data_filter <- raw_data %>% 
  filter(FastaFiles=="UP000000589_mouse_reviewed_can_220920") %>%
  dplyr::select(Genes,ProteinGroups,ProteinDescriptions,ProteinNames, which(names(.) %in% meta_data$label)) %>% 
  mutate_all(~ifelse(is.nan(.), NA, .)) %>% 
  filter_at(vars(starts_with("sample")), any_vars(.!= "Na")) 

Let's focus on the last line, in this line we remove all proteins that have missing values in all samples: `filter_at` is similar to `filter`, which you have used extensively before, but performs filtering on the list of columns that are provided in the first argument `vars(starts_with("sample"))`. In the next argument we use the function `any_vars()` which is a logical operator that sums the results of the boolean function (a function that returns either `TRUE` or `FALSE`) `.!= "Na"`. The function `any_vars()` connects the results of running `.!= "Na"` on each variable with a union (`OR` operator). So as long as there is at least one value per protein the row will be retianed.

**Q3.** The function `all_vars()` is a similar logical operator that uses `AND` instead of `OR`. Describe what would have happened if we would have used it instead of `any_vars()`.

The dataset you are using has unique Uniprot identifiers. However, those are not immediately informative. The associated gene names are more useful, yet these are not always unique mostly due to presence of protein isoforms. So, next we need to make some adjustments to make unique identifiers. Do do this, we have created a custom function to help with this `make_unique()`. (If you are curious what the code does, you can look at `functions.R`)

Execute the code below to make all identifiers unique:

In [None]:
raw_data_filter <- make_unique(raw_data_filter, "Genes", "ProteinGroups", delim = ";")

#### Generate a SummarizedExperiment object

In order to use standartized tools for expression analysis, we will save our data as a `SummarizedExperiment` (SE) object. If you need a quick reminder, go back to the first RNAseq module. Previously, we constructed this object directly from the output of the read mapping algorithm SALMON. But it turns out that it is possible to construct this object directly from a data matrix with expression values. To construct a SE object, we will use the another custom function `make_se()` that takes as arguments a data matrix, a list of data columns and an experimental design.

Review and run the code below to construct a SE object.

In [None]:
data_for_se <- raw_data_filter %>%
    dplyr::select(-c(ProteinGroups, Genes, ProteinDescriptions,ProteinNames)) %>%
    mutate_at(vars(-name, -ID), as.numeric) 
#note that the data was loaded as chr, in the last line we convert it to numerical values

sample_columns_for_se <- grep("sample_", colnames(data_for_se))

#the experimental design is generated directly from the meta data that we loaded at the beginning
experimental_design <- meta_data %>%
  dplyr::select(label,condition,replicate)

#now that we have everything we need lets generate our SE object
data_se <- make_se(data_for_se, sample_columns_for_se, experimental_design)

experimental_design
head(data_for_se)
data_se

Note that the first column in the experimental design matches the column names for our samples.

We also printed some information about the SE object, this also lists functions that we can use to extract data from this object as `assays`, `rownames` etc.

### Analyzing and visualizing missing values

The first step in the analysis of proteomics data, before doing a quntitative analysis, is to simply ask how many proteins were identified in each sample and if there are proteins that are exclusive to only one sample.

We will first extract the data from our SE object and save it as a `tibble`, so we can use our tidyverse skills to make some plots.

In [None]:
data_tb <- assay(data_se) %>% as_tibble(rownames = 'gene')
head(data_tb)

**Q4.** Write code to plot a histogram of the number of samples each protein is identified in, here is one suggestion on how to do this:

- Use the `gather()` function to tidy your data such that each row contain a single measurement with the gene name and sample
- Add another column `found`; set the value of the entry to `0` if the value is `NA` and `1` if the value is numerical (Hint: use `mutate()` and `ifelse()`. `ifelse()` is a handy function which can pair with `is.na()`, though you will need to specify the correct arguments in the correct spots!)
- Lastly `group_by()` gene name, summarize using the column `found`, and make a ggplot histogram with `bins = 10`, as there are 10 total samples (5 wildtype and 5 mutant).


It looks like most proteins are identified in almost all samples, that is great!

Still, a useful QC step in the analysis of proteomics data is to visualize the pattern of missing values. We do this mainly to test two things:
- If a specific sample has an large excess of missing values, then we may want to remove that sample.
- Whether the data is Missing At Random (MAR) or missing not at random (MNAR). 

In MNAR, missing values will be mostly observed at low expression values, suggesting that these are proteins that are below the detection limit in specific samples. Proteomics data can me MNAR, MAR, or a combination of both. (The pattern of missing values may impact interpreation or the choice of  algorithms that you might use to _impute_ missing data instead).


**Q5.** To visualize missing values we will use the function `pheatmap()` (i.e., "Pretty" heatmap). Use tidyverse with `data_tb` as the input to generate a new tibble `data_tb_plot` that we will use to plot the heatmap:

- Add another column `total` with the function `rowSums()` (make sure you specifiy `na.rm=TRUE`! Otherwise, every row with `NA` will sum to `NA`)
- `arrange` `data_tb_plot` at a descending order using the column `total`
- remove the column `total` and the column `gene`
- report the `head()` and `dim()` of `data_tb_plot`

Use the code below to plot the heatmap:

In [None]:
missing_values_heatmap <- pheatmap(as.data.frame(data_tb_plot),
                                   cluster_rows = FALSE, 
                                   cluster_cols = FALSE, 
                                   show_rownames = FALSE, 
                                   legend= TRUE,
                                   na_col = "black",
                                   main = "Missing values pattern for sorted proteins (based on intensity)")

**Q6.** What do you think? Does the data look MAR or MNAR?

We will further filter the data to take only proteins that are represented in at least 3 replicates for at least on condition (wt or mutant), we will do this using the custom function `filter_missval()`. The argument `thr` sets the threshold for the allowed number of missing values in at least one condition.

Run the code below:

In [None]:
data_se_filt <- filter_missval(data_se, thr = 1)

### Using UpSet plots to identify proteins that are exclusive to only one group

Next, we will use an UpSet plot to visualize the intersection between the two conditions in the identified proteins. UpSet plots are an advanced version of Venn diagrams that can help with complex intersections (between multiple groups). While here we have a relatively simple comparison (only two groups), we want to get you familiar with this graphical representation. You can read more about these plots here: https://jokergoo.github.io/ComplexHeatmap-reference/book/upset-plot.html

The input to generate a UpSet plot can be a list of sets or a binary association matrix (which we will use here).

Our matrix will have proteins as rows and samples as comlums. The values in the matrix will be `1` if the protein was identified in the given sample, `0` otherwise.

There are many ways to generate this matrix from our filtered SE using tidyverse. We provide code to you to do this below. In order to understand this code, it is best to run it line-by-line by inserting `head()` each time between the line and adding the assignment to a new variable only after you got a good idea of how this code works.

In [None]:
assoc_mat <- assay(data_se_filt) %>% 
    as_tibble(rownames = 'gene') %>%
    gather(sample, exp, -gene) %>% 
    separate(sample, c("group", "Replicate")) %>% #we seperate sample to group and replicate id
    group_by(gene, group) %>% 
    summarize(Mean = mean(exp, na.rm=TRUE)) %>%
    ungroup() %>% 
    mutate(group = factor(group)) %>%
    pivot_wider(names_from = group, values_from = Mean) %>% #here we go from the tidy dataset to a matrix again
    mutate(across(where(is.numeric), ~replace_na(.,0))) %>%   
    mutate(across(where(is.numeric), ~replace(., .>0,1)))

Now use the code below to generate the plot:

In [None]:
upset(assoc_mat, colnames(assoc_mat), min_size=1,
                                  themes=upset_modify_themes(list('Intersection size'=theme(
                                    axis.text=element_text(size=12, face='bold'),
                                    axis.title=element_text(size=10, face='italic')))))

**Q7.** Now write code that will use `assoc_mat` to list the proteins that are exclusive to the `Mutant` and `wt`. 

### Normalization, data imputation and PCA

#### Normalization
Similar to RNAseq data, also in genome-scale proteomics experiments, we assume that the majority of proteins do not change in abundance (across samples in the same project) and therefore samples should have very similar intensity distribution. 

**Q8.** Complete the code below to produce a violin plot to compare the intensity distribution accross samples (you have done this many time before, most recently in the exam!)

In [None]:
assay(data_se_filt) %>%
    as_tibble(rownames = 'gene') %>%
    
    

The data looks quite good but to make sure that it is indeed normalized we will use a simple median centering normalization. Run the code below to produce a normalized SE.

In [None]:
data_se_filt_norm <- data_se_filt
assay(data_se_filt_norm) <- scale(assay(data_se_filt_norm),center=FALSE, scale=FALSE)

Note that we use `scale=FALSE`. Using `TRUE` would also standardize the data.

#### Data imputation

There are many approaches for data imputation, the most simple would be to simply remove proteins with missing values. This is not recomended as the missing values could be due to low protein levels in specific samples. Another approach would be to replace missing values with very low values in the case that the data is MNAR. 

There are also many algorithms that try to model the data to computationally impute missing values, a recent comparison of these approaches can be found here: https://www.nature.com/articles/s41598-021-81279-4

In our case, the task is relatively simple: Our data set does not have a lot of missing values, and those that are mostly MNAR.

We will use the package `msImpute` as it can deal with both MAR and MNAR and work quite well with a resonable run time. Run the code below to replace the data in your SE object with imputed data.

In [None]:
y <- assay(data_se_filt_norm)
colnames(y)
group <- as.factor(c(rep('Mutant',1,5),rep('Wt',1,5)))
y_imp <- msImpute(y, method = "v2-mnar", group = group)

data_se_filt_norm_imp <- data_se_filt_norm
assay(data_se_filt_norm_imp) <- y_imp

#### PCA

The last step is to make sure that no replicate samples are outliers. For that, we will produce a PCA plot to visualize the samples is a 2D plot. Here, we will use the package `PCAtools` which is very similar to the previous PCA plots we produced in the course.

For this analysis we will need to produce two objects: a matrix with the data and the meta data object with information about each sample. The column names of the data need to match the row names of the meta data object!

Review and run the code below, this will give you an addtional option to run a PCA analysis that does not require DESeq.

In [None]:
data_for_pca <- assay(data_se_filt_norm_imp)

meta_data_pca <- meta_data %>%
    mutate(condition_rep = paste(condition, replicate, sep = '_')) %>%
    as.data.frame() %>%
    column_to_rownames('condition_rep') %>%
    select(replicate, condition) 


p <- pca(data_for_pca, metadata = meta_data_pca) 

biplot(p, labSize = 3, pointSize = 5, colby = "condition",
       legendPosition = "bottom", legendLabSize = 8, legendTitleSize = 10, axisLabSize = 10)

The samples nicely separate between WT and Mutant -- the first principal component (PC1) separates WT from mutants. 


But what about PC2? 

There seem to be additional variation that separates some set of WT/Mutants {Mutant 1/3/4 + WT 2/1} vs. {Mutant 2/5 + WT 4/5/3}. Could this reflect an effect of the batch on how the samples were potentially processed? More on this at the end of the course, so stay tuned! For now we will continue with the data as it is.

# Homework

### Differential abundance analysis using Limma

A popular application of proteomics, simiar to RNAseq, is to identify proteins that show differential abundance between conditions. Previously we used `DESeq2` to analyze RNAseq data: this package was written for RNAseq which expects count data (i.e., read counts). 

But the case of proteomics data, we have a continous measure of abundance. While could simply round up our proteomics data to produce pseudo-count data and use DESeq, this is quite ad-hoc. Instead, let us instead perform an alternative analysis using the `limma` package.

Limma stands for "Linear Models for Microarray Data" and (as the name implies) was initially developed for the analysis of microarray data. Later, it was developed into a bioconductor package that can handle mutiple data types. You can read about the package in this NAR paper: https://academic.oup.com/nar/article/43/7/e47/2414268

Limma calculates a gene-wise linear model based on the user-specified experimental design desired (e.g., stratifying treated/untreated, etc.). It uses an Empirical bayes statistical model to "borrow" information between genes that is used to calculate a posterior variance estimator for each gene in each sample. The algorithm has addtional features as sample weights that can put less emphasis on low quality samples, variance modeling and pre-processing functions that can be used to reduce the noise in the data.

To start using limma we will generate two variables: a design matrix and a data matrix:
Run the code below to generate a design matrix.

Here's a reference to a very good descriptor of creating design matricies using limma: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7873980/


In [None]:
#this is the sample information that we would like to include in our analysis
group <- factor(meta_data$condition)

#construct the design matrix
design <- model.matrix(~0 + group)

#adjust the column names of the design matrix
colnames(design) <- levels(group)
design

**Q9.** Take a look at the design matrix, what do the entries in the design matrix mean?

#### Running limma
The code below will perform a full limma DE analysis, go through the code line-by-line to get an idea of the different steps.

In [None]:
#generate the data in a matrix form for limma analysis
data_limma <- assay(data_se_filt_norm_imp) %>% data.frame()

#calculate a linear model for each gene using the data and design matrix
fit <- lmFit(data_limma, design = design)

# construct a contrast matrix, this will tell the algorith which comparison to make, here we compare Mutant to wt
contrast_matrix <- makeContrasts("Mutatnt_vs_WT" = Mutant - wt, levels = design)

# Fit the model according to the contrasts matrix
contrasts_fit <- contrasts.fit(fit, contrast_matrix)  

# Re-smooth the Bayes according to the contrast matrix
contrasts_fit <- eBayes(contrasts_fit)

# Here we finally calculate differential abundance and produce the results table
results_table <- topTable(contrasts_fit, number = nrow(data_limma))

**Q10.** Generate a volcano plot of the results, how many proteins are differentially up and down regulated at an adusted p-value of 0.001?

### Enrichment analysis:

Now that we have a list of differentially abundant proteins, we would like to test if these genes are enriched for specific biological function or in pathways, to help shed light on the biological processes that are disrupted in our mutant mice. Similar to RNA-seq, this will be done by looking for enriched gene annotations within the list of differentialy abundant proteins.

Previously, we have done this using online tools. However, this analysis can also be done directly in your R coding environment. Several popular packages exist including `GSEAbase`, `gprofiler2` and `clusterProfiler`, which we will use here. `clusterProfiler` performs a Gene Ontology enrichemnt analysis comparing a list of differential genes to a background list of genes. We encourage you to try the other packages as well, either within this course or in your future work.

Let's first load the required packages:

In [None]:
library(clusterProfiler)
library(org.Mm.eg.db)
library(ggnewscale)

**Q11.** Generate two variables to start your enrichemnt analysis: 
- `gene_list_down` - list of gene names that have `adj.P.Val < 0.05` and `logFC < -0.5` (you might need to convert `results_table` to a tibble, filter it and take only the gene name column)
- `background_set` - list of all genes identified in the proteomics data (make sure this list is unique!)

(Tip - you can use the function `pull` to extract a single column from a tibble)

Run the code below to generate the object `ego` containing your enrichment analysis results.

In [None]:
ego <- enrichGO(gene          = gene_list_down,
                universe      = background_set,
                OrgDb         = org.Mm.eg.db,
                ont           = "ALL",
                keyType       = "SYMBOL",
                pAdjustMethod = "fdr",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE)
ego


Above you can see a summary of the results, we can also visualize results from an enrichemnt analysis graphically, run the code below to generate a plot summarizing the results.

In [None]:
enrichGO <- dotplot(ego, split="ONTOLOGY", font.size = 6, label_format = function(x) stringr::str_wrap(x, width=60)) + facet_grid(ONTOLOGY~., scale="free")
enrichGO

**Q12.** Briefly describe the results in the graph, what does each point represent, axis, how do the three graphs differ from each other.