Characterizing the landscape of gene expression variance in humans

Repository contains scripts to reproduce results of the paper as below:

Characterizing the landscape of gene expression variance in humans. Scott Wolf*, Diogo Melo*, Kristina M. Garske, Luisa F. Pallares, Amanda J. Lea, Julien F. Ayroles.

Workflow

The main pipeline for generating gene-level metrics is built using Snakemake. To run the pipeline, run snakemake -c N where N is the number of cores(or all) from the snakemake directory. Many of the supporting functions can be found in functions.R.

conda env create -f environment.yml

To run this, you will need to install the packages referenced below and Pull https://bioconductor.org/packages/release/bioc/src/contrib/ExpressionAtlas_1.24.0.tar.gz and run R CMD INSTALL ExpressionAtlas_1.24.0.tar.gz.

install.packages("pak")
pak::pkg_install(c("log4r","here","ExpressionAtlas","recount3","wesanderson", "cowplot", "rrcov", "AnnotationDbi", "DESeq2", "sergihervas/iMKT", "clusterProfiler", "viridis", "org.Hs.eg.db","edgeR", "janitor", "vsn","doMC", "Rfast", "missMDA", "corrplot", "vegan", "Hmisc", "ggfortify"))

Snakemake pipeline

The following notes describe each piece of the Snakemake workflow in detail. The Snakemake file can be viewed at snakemake/Snakefile. The corresponding config file detailing the datasources is located at snakemake/config.yaml, and the config for running this on a Slurm cluster is located at snakemake/config_slurm.yaml.

Data download

After reading each set of ids and appending the corresponding source, we download the data from the source. This is done by passing our study ids along with the corresponding source (Expression Atlas or recount3) and metadata for each. For sources found in Expression Atlas and recount3, we use the corresponding R packages (ExpressionAtlas and recount3) to pull the rawdata and corresponding metadata. For each of these, we output the data in /snakemake/Rdatas/raw/{id}.rds. The source for doing this can be found in: download_datasets.R.

Preprocessing

After gathering the raw data, we need to preprocess the data. We initially passing the id and metadata. After loading the datasets, we preprocess the data by:

Filtering out manually curated columns of problematic metadata. The columns that were hand curated can be found in snakemake/metadata/EA_metadata.csv and snakemake/metadata/recount3_metadata.csv accordingly.
Removing portions of studies where cell lines are used (i.e. Cultured Fibroblasts in the Skin tissue set from GTEx) and where multiple tissue types are pooled (i.e. Esophagus from GTEx).
Filtering to get only control samples from each study.
Summing technical replicates.
Drop gene with expression less than 1 cpm.
Drop genes with mean expression below 5 cpm
Drop mitochondrial genes
Remove extreme outliers from BLOOD and Stomach samples
Recalculate normalization factors on the remaining data
Remove redundant metadata
Remove factors with large number of levels

Following this, for each dataset, we output the filtered expression data and its corresponding filtered metadata at /snakemake/Rdatas/preProcess/{id}.rds. The source for doing this can be found in: preprocessing.R.

Calculating residuals

Following preprocessing, we want to calculate residual expression after removing the confounding factors. We do this for each study by:

Making a design matrix using preprocessed data and ignoring manually identified problematic columns.
Calculate a variance stabilizing transform as implemented in DESeq2.
Filter outlier individuals using a robust Principal Component Analysis approach as described by Chen et al. 2020.
Recalculate design matrix after filtering for outliers and stabilizing variance.

Postprocessing

CSV residuals

Generates a CSV file containing the residuals for each study.

Graph

Creates a a fully connected gene-by-gene graph in which each edge is weighted by the Spearman correlation between gene expression for each study.

Network stats

Gene metrics

Variance Scores

Calculates a standard, cross-study gene expression variation using the post correction expression residuals.

Postprocessing

Postprocessing and analysis can be found in analysis

Name		Name	Last commit message	Last commit date
Latest commit History 443 Commits
analysis		analysis
data		data
papers		papers
snakemake		snakemake
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md
functions.R		functions.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Characterizing the landscape of gene expression variance in humans

Workflow

Snakemake pipeline

Data download

Preprocessing

Calculating residuals

Postprocessing

CSV residuals

Graph

Network stats

Gene metrics

Variance Scores

Postprocessing

About

Uh oh!

Releases 3

Packages

Languages

ayroles-lab/expressionVariance-code

Folders and files

Latest commit

History

Repository files navigation

Characterizing the landscape of gene expression variance in humans

Workflow

Snakemake pipeline

Data download

Preprocessing

Calculating residuals

Postprocessing

CSV residuals

Graph

Network stats

Gene metrics

Variance Scores

Postprocessing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages