vignettes/CEMiTool.Rmd

---
title: 'CEMiTool: Co-expression Modules Identification Tool'
output:
  html_document:
    toc: yes
  pdf_document:
    toc: yes
  prettydoc::html_pretty:
    highlight: github
    theme: cayman
    toc: yes
vignette: > 
  %\VignetteIndexEntry{CEMiTool: Co-expression Modules Identification Tool}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---


```{r style, echo=FALSE, results="asis", message=FALSE}
knitr::opts_chunk$set(tidy = FALSE,
                      warning = FALSE,
                      message = FALSE,
                      cache=TRUE)
```

# Basic usage

The `CEMiTool` `R` package provides users with an easy-to-use method to automatically
run gene co-expression analyses. In addition, it performs gene set enrichment analysis and over representation analysis for the gene modules returned by the analysis.

For the most basic usage of `CEMiTool` only a `data.frame` containing expression data
with gene symbols in the rows and sample names in the columns is needed, as following:

```{r}
BiocManager::install("CEMiTool")
library("CEMiTool")
# load expression data
data(expr0)
head(expr0[,1:4])
```

In this usage, the `cemitool` function receives the expression data, performs the co-expression modules analysis and returns a `CEMiTool` object:
```{r, results='hide'}
cem <- cemitool(expr0)
```
To see a summary of the slots inside the `CEMiTool`, just call `cem`
```{r}
cem
```

The `cemitool()` function automatically executes some common analyses, depending on the input data. The following sections describes how to perform each of these analyses separately. Details on how to perform all analyses together are at the end of this vignette.

## Gene filtering
As a default, the `cemitool` function first performs a filtering of the gene expression data before running the remaining analyses. This filtering is done in accordance to gene variance. In this example the filtering step has reduced the gene number to `r length(cem@selected_genes)`. 

## Module inspection
The module analysis has produced `r CEMiTool::nmodules(cem)` modules and the allocation of genes to modules can be seen with the `module_genes` function:

```{r}
# inspect modules
nmodules(cem)
head(module_genes(cem))
```

Genes that are allocated to `Not.Correlated` are genes that are not clustered into any module.

If you wish to adjust the module definition parameters of your `CEMiTool` object, use `find_modules(cem)`.

You can use the `get_hubs` function to identify the top `n` genes with the highest connectivity in each module: `hubs <- get_hubs(cem,n)`. 
A summary statistic of the expression data within each module (either the module mean or eigengene) can be obtained using: `summary <- mod_summary(cem)`
 
## Generating reports

The information generated by CEMiTool, including tables and images can be accessed by generating a report of the `CEMiTool` object:
```{r, eval=FALSE}
generate_report(cem)
```

Also, you can create tables with the analyses results using: 
```{r}
write_files(cem)
```

Plots containing analysis results can be saved using:
```{r}
save_plots(cem, "all")
```


# Adding sample annotation

More information can be included in CEMiTool to build a more complete object and generate richer reports about the expression data. Sample annotation can be supplied in a `data.frame` that specifies a class for each sample. Classes can represent different conditions, phenotypes, cell lines, time points, etc. In this example, classes are defined as the time point at which the samples were collected.

```{r}
# load your sample annotation data
data(sample_annot)
head(sample_annot)
```

Now you can construct a `CEMiTool` object with both expression data and sample annotation:

```{r, results='hide'}
# run cemitool with sample annotation
cem <- cemitool(expr0, sample_annot)
```

The sample annotation of your CEMiTool object can be retrieved and reassigned using the `sample_annotation(cem)` function. This function can also be used to define the columns with sample names and sample groupings (which are "SampleName" and "Class", by default):

```{r}
sample_annotation(cem, 
                  sample_name_column="SampleName", 
                  class_column="Class") <- sample_annot
```

## Module enrichment

When sample annotation is provided, the `cemitool` function will automatically evaluate how the modules are up or down regulated between classes. This is performed using the gene set enrichment analysis function from the `fgsea` package. 

You can generate a plot of how the enrichment of the modules varies across classes with the `plot_gsea` function. The size and intensity of the circles in the figure correspond to the Normalised Enrichment Score (NES), which is the enrichment score for a module in each class normalised by the number of genes in the module. This analysis is automatically run by the `cemitool` function, but it can be independently run with the function `mod_gsea(cem)`.

```{r}
# generate heatmap of gene set enrichment analysis
cem <- mod_gsea(cem)
cem <- plot_gsea(cem)
show_plot(cem, "gsea")
```

## Expression patterns in modules

You can generate a plot that displays the expression of each gene within a module using the `plot_profile` function:

```{r}
# plot gene expression within each module
cem <- plot_profile(cem)
plots <- show_plot(cem, "profile")
plots[1]
```

# Adding ORA analysis

CEMiTool can determine which biological functions are associated with the modules by performing an over representation analysis (ORA). To do this you must provide a pathway list in the form of GMT file. CEMiTool will then analyze how these pathways are represented in the modules.

You can read in a pathway list formatted as a GMT file using the read_gmt function. This example uses a GMT file that comes as part of the `CEMiTool` example data:

```{r}
# read GMT file
gmt_fname <- system.file("extdata", "pathways.gmt", package = "CEMiTool")
gmt_in <- read_gmt(gmt_fname)
```

You can then perform ORA analysis on the modules in your `CEMiTool` object with the `mod_ora` function:

```{r}
# perform over representation analysis
cem <- mod_ora(cem, gmt_in)
```

The numerical results of the analysis can be accessed with the `ora_data` function. In order to visualise this, use `plot_ora` to add ORA plots to your `CEMiTool` object. The plots can be accessed with the `show_plot` function.

```{r}
# plot ora results
cem <- plot_ora(cem)
plots <- show_plot(cem, "ora")
plots[1]
```

# Adding interactions

Interaction data, such as protein-protein interactions can be added to the `CEMiTool` object in order to generate annotated module graphs. Interaction files are formatted as a `data.frame` or `matrix` containing two columns for interacting pairs of genes.

```{r}
# read interactions
int_fname <- system.file("extdata", "interactions.tsv", package = "CEMiTool")
int_df <- read.delim(int_fname)
head(int_df)
```

You can add the interaction data to your `CEMiTool` object using the `interactions_data` function and generate the plots with `plot_interactions`. Once again, the plots can be seen with the `show_plot` function:

```{r}
# plot interactions
library(ggplot2)
interactions_data(cem) <- int_df # add interactions
cem <- plot_interactions(cem) # generate plot
plots <- show_plot(cem, "interaction") # view the plot for the first module
plots[1]
```

# Putting it all together...

Finally, a `CEMiTool` object with all of the components mentioned above can also be constructed using just the `cemitool` function:

```{r, eval=FALSE}
# run cemitool
library(ggplot2)
cem <- cemitool(expr0, sample_annot, gmt_in, interactions=int_df, 
                filter=TRUE, plot=TRUE, verbose=TRUE)
# create report as html document
generate_report(cem, directory="./Report")

# write analysis results into files
write_files(cem, directory="./Tables")

# save all plots
save_plots(cem, "all", directory="./Plots")
```

# Troubleshooting

Sometimes, CEMiTool analyses can fail, usually due to problems in the input data. We provide a function, `diagnostic_report` which aims to try to assist in resolving these issues. 

```{r, eval=FALSE}
diagnostic_report(cem)
```

The function will return six different plots inside the report, all of which can be used to assess problems in the data. We will briefly discuss how each plot can be used to evaluate data problems.

## Sample clustering tree
This plot aims to show if there are closely related groups within samples. If a sample annotation file is given, the plot will show different colors for each sample group, and any numerical data given in other columns as a heatmap. This information can be used in order to see how homogeneous/heterogeneous the input data are. Highly heterogeneous sample groups may be the cause of batch effects, which should be removed. 

## Mean x variance scatterplot
In this plot, the mean and the variance of each gene in the expression file is plotted as the x and y coordinates of the scatterplot, and a line is plotted in order to show the relationship between the two. Particularly for RNAseq data, if a strong R-squared value is found, one should set the `apply_vst` argument in the `cemitool()` function to `TRUE` in order to remove this correspondance.

## Quantile-quantile plot and expression histogram
These plots are intended to highlight the distribution of expression values. A Q-Q plot is a mathematical approach to determine if data possibly arose from a theoretical distribution such as the normal distribution. This information can be used to guide the selection of an appropriate correlation coeffiecient for analyses, which can be changed via the `cor_method` argument of the `cemitool()` function. Currently accepted coefficients are "pearson" and "spearman".

## Beta x R2 plot and Mean connectivity plot
These plots can only be generated if the `diagnostic_report()` function is run after the `cemitool()` function. The Beta x R2 plot is used to visualize the selection of the soft-thresholding parameter Beta and its corresponding adherence to the scale-free topology model. Selected Beta values will be shown in red, unless NA. 
The Mean connectivity plot is intended to show the tradeoff between the network's underlying connectivity and a higher adherence to the scale-free topology model (via higher values of the soft-threshold Beta parameter).