Repository contains scripts to reproduce results of the paper as below:
Characterizing the landscape of gene expression variance in humans. Scott Wolf*, Diogo Melo*, Kristina M. Garske, Luisa F. Pallares, Amanda J. Lea, Julien F. Ayroles.
The main pipeline for generating gene-level metrics is built using Snakemake. To run the pipeline, run snakemake -c N
where N is the number of cores(or all) from the snakemake
directory. Many of the supporting functions can be found in functions.R.
conda env create -f environment.yml
To run this, you will need to install the packages referenced below and
Pull https://bioconductor.org/packages/release/bioc/src/contrib/ExpressionAtlas_1.24.0.tar.gz
and run R CMD INSTALL ExpressionAtlas_1.24.0.tar.gz
.
install.packages("pak")
pak::pkg_install(c("log4r","here","ExpressionAtlas","recount3","wesanderson", "cowplot", "rrcov", "AnnotationDbi", "DESeq2", "sergihervas/iMKT", "clusterProfiler", "viridis", "org.Hs.eg.db","edgeR", "janitor", "vsn","doMC", "Rfast", "missMDA", "corrplot", "vegan", "Hmisc", "ggfortify"))
The following notes describe each piece of the Snakemake workflow in detail. The Snakemake file can be viewed at snakemake/Snakefile. The corresponding config file detailing the datasources is located at snakemake/config.yaml, and the config for running this on a Slurm cluster is located at snakemake/config_slurm.yaml.
After reading each set of ids and appending the corresponding source, we download the data from the source. This is done by passing our study ids along with the corresponding source (Expression Atlas or recount3) and metadata for each. For sources found in Expression Atlas and recount3, we use the corresponding R packages (ExpressionAtlas and recount3) to pull the rawdata and corresponding metadata. For each of these, we output the data in /snakemake/Rdatas/raw/{id}.rds
. The source for doing this can be found in: download_datasets.R.
After gathering the raw data, we need to preprocess the data. We initially passing the id and metadata. After loading the datasets, we preprocess the data by:
-
Filtering out manually curated columns of problematic metadata. The columns that were hand curated can be found in snakemake/metadata/EA_metadata.csv and snakemake/metadata/recount3_metadata.csv accordingly.
-
Removing portions of studies where cell lines are used (i.e. Cultured Fibroblasts in the Skin tissue set from GTEx) and where multiple tissue types are pooled (i.e. Esophagus from GTEx).
-
Filtering to get only control samples from each study.
-
Summing technical replicates.
-
Drop gene with expression less than 1 cpm.
-
Drop genes with mean expression below 5 cpm
-
Drop mitochondrial genes
-
Remove extreme outliers from BLOOD and Stomach samples
-
Recalculate normalization factors on the remaining data
-
Remove redundant metadata
-
Remove factors with large number of levels
Following this, for each dataset, we output the filtered expression data and its corresponding filtered metadata at /snakemake/Rdatas/preProcess/{id}.rds
. The source for doing this can be found in: preprocessing.R.
Following preprocessing, we want to calculate residual expression after removing the confounding factors. We do this for each study by:
-
Making a design matrix using preprocessed data and ignoring manually identified problematic columns.
-
Calculate a variance stabilizing transform as implemented in DESeq2.
-
Filter outlier individuals using a robust Principal Component Analysis approach as described by Chen et al. 2020.
-
Recalculate design matrix after filtering for outliers and stabilizing variance.
Generates a CSV file containing the residuals for each study.
Creates a a fully connected gene-by-gene graph in which each edge is weighted by the Spearman correlation between gene expression for each study.
Calculates a standard, cross-study gene expression variation using the post correction expression residuals.
Postprocessing and analysis can be found in analysis