Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
259 lines (199 sloc) 8.98 KB
---
title: "Data mining using lipidr"
author:
- name: Ahmed Mohamed
affiliation: Precision & Systems Biomedicine, QIMR Berghofer, Australia
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_depth: 2
vignette: >
%\VignetteIndexEntry{lipidr_data_mining}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
resource_files:
- figure/
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.width = 8)
library(knitr)
library(widgetframe)
library(rmarkdown)
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.width = 8, fig.path = "figure/")
knitr::opts_chunk$set(widgetframe_isolate_widgets = TRUE)
knit_print.data.frame = function(x, ...) {
rmarkdown:::print.paged_df(rmarkdown::paged_table(x))
}
knit_print.DataFrame = function(x, ...) {
knit_print(as.data.frame(x))
}
knit_print.Iheatmap = function(x, ...) {
knit_print(frameWidget(iheatmapr::to_widget(x), height = 800))
}
registerS3method("knit_print", "data.frame", knit_print.data.frame)
registerS3method("knit_print", "DataFrame", knit_print.DataFrame)
registerS3method("knit_print", "Iheatmap", knit_print.Iheatmap)
registerS3method("knit_print", "IheatmapHorizontal", knit_print.Iheatmap)
```
# Introduction
Through integration with Metabolomics Workbench API, `lipidr` allows users,
to quickly explore public lipidomics experiments. `lipidr` provides an easy
way to re-analyze and visualize these datasets.
# Loading libraries
```{r libload}
library(lipidr)
use_interactive_graphics()
```
First, we load `lipidr` using `library` command. For this vignette we will enable interactive graphics plotting by calling `use_interactive_graphics()`.
# Explore Metabolomics Workbench lipidomics datasets
Users can search all Metabolomics Workbench studies using any
relevant keyword. A list of all studies matching the keyword will be
returned as a table (`data.frame`).
```{r list_studies}
list_mw_studies(keyword = "lipid")
```
# Download datasets
Datasets can be easily downloaded and parsed into `LipidomicsExperiment` object
using `lipidr` function `fetch_mw_study()` by supplying a `study_id`. The
example shown in this vignette is from `ST001111` [link](https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&DataMode=AllData&StudyID=ST001111&StudyType=MS&ResultType=1). The dataset contains positive and negative MS data for untargeted
lipidomics from different breast cancer tissues.
```{r fetch_study}
d = fetch_mw_study("ST001111")
d
```
Note the warning that some molecules were not parsed because their names did not follow the supported patterns. We can examine these molecules, remove them from the dataset or change their names, if desired.
```{r update_names}
# list non_parsed molecules
non_parsed_molecules(d)
# All of them are Ceramides, written with full chemical name
# We can replace the first part with "Cer" using RegEx
non_parsed <- non_parsed_molecules(d)
new_names <- sub("^.* \\(", "Cer (", non_parsed)
d <- update_molecule_names(d, old = non_parsed, new = new_names)
# We can check once again to make sure all molecules were parsed correctly
non_parsed_molecules(d)
```
We can have a look at the clinical data, which was conveniently
extracted from Metabolomics Workbench by `lipidr`.
```{r coldata}
colData(d)
```
Next, we tell `lipidr` that our dataset is normalized and logged.
```{r set_logged}
d <- set_logged(d, "Area", TRUE)
d <- set_normalized(d, "Area", TRUE)
```
## Quality control
We look at total ion concentration (TIC) and distribution (boxplot) for each sample.
```{r}
plot_samples(d, "tic")
plot_samples(d, "boxplot")
```
Although the TIC plot looks similar for all samples, we can spot two outlier samples with significantly large dispersion (samples `42` and `18`). We will keep the samples for now, but we may want to consider checking closely.
Because there is no QC samples or technical replicates in this dataset, we cannot
assess the %CV of molecules. Also, there is no need to `summarize_transitions`
in untargeted datasets.
# PCA
```{r}
mvaresults = mva(d, measure="Area", method="PCA")
plot_mva(mvaresults, color_by="SampleType", components = c(1,2))
```
We can see mild separation between benign and cancer samples, but not between cancer and metastasis. We can also spot the sample outlier samples that should consider removing. The low variance explained by `PC1` and `PC2`
(cumulative displayed in the plot as `R2X`) indicate highly variable lipid profiles in these clinical samples.
```{r remove_outliers}
keep_samples <- !colnames(d) %in% c("18", "42")
d <- d[, keep_samples]
```
# Univariate Analysis
## Two group comparison
For simple analysis, we can compare cancer vs benign and cancer vs metastasis. From PCA plot, we can
expect very little difference between cancer and metastasis.
```{r two_group}
two_group <- de_analysis(d, Cancer-Benign, Cancer-Metastasis)
plot_results_volcano(two_group)
```
By quickly looking at the volcano plots, we can confirm the minute difference between cancer and metastatic samples. A fairly large difference is observed between cancer and benign samples, with PCs and PGs up-regulated and CLs and TGs down-regulated in cancer tissues.
## Multi-group comparison
Instead of two-group comparison, we might be interested in lipids are differentially expressed in any group.
We can perform ANOVA-style multi-group comparison using `de_design` function, which allows users to provide
a custom *design matrix*. `de_design` is extremely helpful in complex experimental designs, where several
factors should be accounted for in the analysis.
In this example, we will use cancer stage as our grouping variable. In our dataset, we can see samples with Stages I-IV. Using ANOVA-style analysis, we will identify all lipid molecules likely to be different among cancer stages.
```{r multi-group_stage}
multi_group <- de_design(d, ~ Stage)
```
Here we used the formula *tilde* to define our predictor variables. `~ Stage` formula indicates that we are interested in features (lipid molecules) that are associated with `Stage`.
```{r multi-group_stage_sig}
significant_molecules(multi_group)
```
Surprisingly, Cancer Stage does not appear to affect lipid molecules profiled in this experiment.
## Factorial analysis
In complex experimental designs and clinical samples, we may need to correct for confounding variables. This is done simply by adding the variable to the formula in `de_design`. For example, below, we are interested in Cancer effect while correcting for *Race* effect.
```{r}
factorial_de <- de_design(d, ~ Race + SampleType, coef = "SampleTypeCancer")
significant_molecules(factorial_de)
plot_results_volcano(factorial_de)
```
In this case, we are seeing similar pattern as the two-group comparison, which indicates a small *Race* effect.
Users interested in creating more complex design matrices are referred to [Limma User Guide](https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf) and [edgeR tutorial](https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf).
# Multivariate analysis
## Orthogonal multivariate analysis
Supervised multivariate analyses, such as OPLS and OPLS-DA can be performed to determine which lipids are associated with a group (y-variable) of interest. In this example we use "Diet" as grouping, and display the results in a scores plot.
```{r}
mvaresults = mva(d, method = "OPLS-DA", group_col = "SampleType", groups=c("Benign", "Cancer"))
plot_mva(mvaresults, color_by="SampleType")
```
We can also plot the loadings and display important lipids contributing to the separation between different (Diet) groups.
```{r}
plot_mva_loadings(mvaresults, color_by="Class", top.n=10)
```
Alternatively, we can extract top *N* lipids along with their annotations.
```{r}
top_lipids(mvaresults, top.n=10)
```
## Supervised multivariate analysis with continuous response variable
OPLS-DA can only be applied to in two-group comparison settings. In some cases, we
might be interested in a lipid molecules ....
In this example, we will format Cancer Stage as a numeric vector.
```{r plsda_stage}
stage <- d$Stage
stage[stage == "I"] <- 1
stage[stage == "II"] <- 2
stage[stage == "III"] <- 3
stage[stage == "IV"] <- 4
stage <- as.numeric(stage)
stage
```
We can see `stage` contains missing values. We should filter them out first.
```{r run_plsda}
d_filtered <- d[, !is.na(stage)]
stage <- stage[!is.na(stage)]
mvaresults = mva(d_filtered, method = "OPLS", group_col = stage )
plot_mva(mvaresults)
```
```{r}
use_interactive_graphics(FALSE)
plot_mva_loadings(mvaresults, color_by="Class", top.n=10)
```
# Enrichment analysis
```{r}
enrich_results = lsea(two_group, rank.by = "logFC")
significant_lipidsets(enrich_results)
```
Visualization of enrichment analysis results. The enriched lipid classes are highlighted.
```{r}
plot_class_enrichment(two_group, significant_lipidsets(enrich_results))
```
# Lipid chain analysis
```{r}
plot_trend(two_group)
```
# Clustering
```{r}
plot_heatmap(d, cluster_rows="none")
```
# Session information
```{r}
sessionInfo()
```
You can’t perform that action at this time.