The FacileData package was written to facilitate easier analysis of large, multi-assay high-throughput genomics datasets. To this end, the FacileData package provides two things:
- A FacileData Access API that defines a fluent interface over multi-assay genomics datasets that fits into the tidyverse. This enables analysts to more naturally query and retrieve data for general exploratory data analysis; and
- A reference implementation of a datastore that implements the FacileData Access API called a FacileDataSet. The
FacileDataSetprovides efficient storage and retrieval of arbitrarily large high-throughput genomics datasets. For example, a singleFacileDataSetcan be used to store all of the RNA-seq, microarray, RPPA, etc. data from the The Cancer Genome Atlas. This singularFacileDataSetallows analysts easy access to arbitrary subsets of these data without having to load all of it into memory.
The FacileData suite of packages is only available from github from now. You will want to install three FacileData* packages to appreciate the its utility:
# install.packages("devtools")
devtools::install_github("Genentech/FacileData")As a teaser, we provide code snippets that show how to plot HER2 copy number vs expression across the TCGA "BLCA" and "BRCA" indications using the= FacileDataSet. We'll then compare that to how the same code might be written using more traditional bioconductor containers.
library(ggplot2)
library(FacileData)
library(FacileTCGADataSet)
tcga <- FacileTCGADataSet()
features <- filter_features(tcga, name == "ERBB2")
fdat <- tcga %>%
filter_samples(indication %in% c("BLCA", "BRCA")) %>%
with_assay_data(features, assay_name = "rnaseq", normalized = TRUE) %>%
with_assay_data(features, assay_name = "cnv_score") %>%
with_sample_covariates(c("indication", "sex"))
ggplot(fdat, aes(cnv_score_ERBB2, ERBB2, color=sex)) +
geom_point() +
facet_wrap(~ indication)