Skip to content

Commit

Permalink
add feature score calculation for downstream analyses e.g., ranked GSEA
Browse files Browse the repository at this point in the history
  • Loading branch information
sreichl committed Jan 5, 2023
1 parent 5994eae commit 6cd10f2
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 5 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,15 +38,15 @@ This project wouldn't be possible without the following software and their depen
# Methods
This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (.yaml file) or post execution. Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g. [X].

__Differential Expression Analysis (DEA).__ DEA was performed on the quality-controlled filtered [raw/normalized] counts using the LIMMA (ver) [ref] workflow for fitting a linear model [formula] to identify features (genes/regions) that statistically significantly change with [comparisons] compared to the control group [reference levels] (intercept). Briefly, we determined normalization factors with edgeR::calcNormFactors (optional) using method [X], then applied voom (optional) to estimate the mean-variance relationship of the log-counts. We used blocking on (optional) variable [X] to account for repeated measurements, lmFit to fit the model to the data, and finally eBayes (optional) with the robust (and trend flag – optional for normalized data) flag to compute (moderated/ordinary) t-statistics. For each comparison we extracted feature-wise average expression, effect sizes (log2 fold change) and their statistical significance as adjusted p-values, determined using the Benjamini-Hochberg method. Next, these results were filtered for relevant features based on the following criteria: statistical significance (adjusted p-value < [X]), absolute log2 fold change (> [X]), and average gene expression (> [X]). Finally, we performed hierarchical clustering on the effect sizes (log2 fold changes) of the union of all relevant features and comparison groups.
__Differential Expression Analysis (DEA).__ DEA was performed on the quality-controlled filtered [raw/normalized] counts using the LIMMA (ver) [ref] workflow for fitting a linear model [formula] to identify features (genes/regions) that statistically significantly change with [comparisons] compared to the control group [reference levels] (intercept). Briefly, we determined normalization factors with edgeR::calcNormFactors (optional) using method [X], then applied voom (optional) to estimate the mean-variance relationship of the log-counts. We used blocking on (optional) variable [X] to account for repeated measurements, lmFit to fit the model to the data, and finally eBayes (optional) with the robust (and trend flag – optional for normalized data) flag to compute (moderated/ordinary) t-statistics. For each comparison we used topTable to extract feature-wise average expression, effect sizes (log2 fold change) and their statistical significance as adjusted p-values, determined using the Benjamini-Hochberg method. Furthermore, we calculated feature scores, for each feature in all comparisons, using the formula [score_formula] for downstream ranked enrichment analyses. Next, these results were filtered for relevant features based on the following criteria: statistical significance (adjusted p-value < [X]), absolute log2 fold change (> [X]), and average gene expression (> [X]). Finally, we performed hierarchical clustering on the effect sizes (log2 fold changes) of the union of all relevant features and comparison groups.

__Visualization.__ The filtered result statistics, i.e., number of relevant features split by positive (up) and negative (down) effect sizes, were visualized with stacked bar plots using ggplot (ver) [ref].
To visually summarize results of all performed comparisons, the filtered effect size (log2 fold change) values of all features that were found to be relevant in at least one comparison were plotted in a hierarchically clustered heatmap using pheatmap (ver) [ref].
Volcano plots were generated for each comparison using EnhancedVolcano (ver) [ref] with adjusted p-value threshold of [pCutoff] and log2 fold change threshold of [FCcutoff] as visual cut-offs for the y- and x-axis, respectively.
Finally, quality control plots of the fitted mean-variance relationship and raw p-values of the features were generated.


**The analysis and visualizations described here were performed using a publicly available Snakemake [ver] (ref) workflow [ref - cite this workflow here].**
**The analysis and visualizations described here were performed using a publicly available Snakemake (ver) [ref] workflow (ver) [ref - cite this workflow here].**

# Features
The workflow performs the following steps that produce the outlined results:
Expand All @@ -63,6 +63,8 @@ The workflow performs the following steps that produce the outlined results:
- extract all statistics for variables of interest (=configured comparisons) using __topTable__ (eg coefficients/effect size, statistical significance,...).
- save a feature list per comparison group and direction of change (up/down) for downstream analyses (eg enrichment analysis) (TXT).
- (optional) annotated feature list with suffix "_annot" (TXT).
- (optional) save feature score tables (with two columns: "feature" and "score") per comparison group using score_formula for downstream analyses (eg ranked enrichment analysis) (CSV).
- (optional) annotated feature scores tables (with two columns: "feature_name" and "score") with suffix "_annot" (CSV).
- DEA result statistics: total number of statistically significant features and split by positive (up) and negative (down) change (CSV).
- DEA result filtering of features (eg genes) by
- statistical significance (<= adjusted p-value: adj_pval)
Expand Down
8 changes: 8 additions & 0 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,14 @@ reference_levels:
time: "0h"

##### AGGREGATION #####
# score_formula is used to calculate a score for each gene and comparison (group) that can be used downstream e.g., for ranked GSEA
# eval based score calculation -> eval(parse(text=" "))
# DEA result dataframe is called: dea_results
# available numerical columns: logFC; AveExpr; t; P.Value; adj.P.Val; B
# common/popular example: "-log10(dea_results$P.Value)*sign(dea_results$logFC)"
# if not used leave empty: ""
score_formula: "-log10(dea_results$P.Value)*sign(dea_results$logFC)"

# filters are applied to the DEA results, which contain all results, and used for downstream outputs (eg feature lists, LFC matrix).
filters:
adj_pval: 0.05
Expand Down
1 change: 1 addition & 0 deletions workflow/rules/dea.smk
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ rule aggregate:
os.path.join("logs","rules","aggregate_{analysis}.log"),
params:
partition=config.get("partition"),
score_formula = config["score_formula"],
adj_pval = config["filters"]["adj_pval"],
lfc = config["filters"]["lfc"],
ave_expr = config["filters"]["ave_expr"],
Expand Down
22 changes: 19 additions & 3 deletions workflow/scripts/aggregate.R
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ dea_stats_plot_path <- snakemake@output[["dea_stats_plot"]]
adj_pval <- snakemake@params[["adj_pval"]] # 0.05
lfc <- snakemake@params[["lfc"]] # 0
ave_expr <- snakemake@params[["ave_expr"]] # 0
score_formula <- snakemake@params[["score_formula"]] # "-log10(dea_results$P.Value)*sign(dea_results$logFC)"

# plot specifications
width <- 0.5
Expand All @@ -32,7 +33,7 @@ if (!dir.exists(results_path)){

### load DEA results
dea_results <- read.csv(file=file.path(dea_result_path))

groups <- unique(dea_results$group)

### save list of all expressed features for downstream analysis (eg as background in enrichment analyses)
all_features <- unique(dea_results$feature)
Expand All @@ -44,6 +45,22 @@ if("feature_name" %in% colnames(dea_results)){
write(all_features_annot, file.path(results_path, "ALL_features_annot.txt"))
}

# determine and save feature scores for each gene and group for downstream analysis e.g., ranked GSEA
if (score_formula!=""){
dea_results$score <- eval(parse(text=score_formula))

for (group in groups){
tmp_features <- dea_results[(dea_results$group==group),c("feature","score")]
write.csv(tmp_features, file.path(results_path,paste0(group,"_featureScores.csv")), row.names=FALSE)

if("feature_name" %in% colnames(dea_results)){
tmp_features <- dea_results[(dea_results$group==group),c("feature_name","score")]
tmp_features <- tmp_features[tmp_features$feature_name != "",]
write.csv(tmp_features, file.path(results_path,paste0(group,"_featureScores_annot.csv")), row.names=FALSE)
}
}
}

# annotate differential direction (up or down)
dea_results$direction <- as.character(lapply(dea_results$logFC, function(x) if(x>0){"up"}else{"down"}))

Expand All @@ -59,12 +76,11 @@ dea_filtered_stats_df$total <- rowSums(dea_filtered_stats_df)
write.csv(dea_filtered_stats_df, file=file.path(dea_stats_path), row.names=TRUE)

### aggregate & save LFC matrix from filtered DEA result features
groups <- unique(dea_results$group)
features <- unique(dea_filtered_results$feature)

lfc_df <- data.frame(matrix(nrow=length(features), ncol=length(groups), dimnames=list(features, groups)))

for (group in unique(dea_results$group)){
for (group in groups){
tmp_dea_results <- dea_results[dea_results$group==group, ]
rownames(tmp_dea_results) <- tmp_dea_results$feature
lfc_df[features, group] <- tmp_dea_results[features, 'logFC']
Expand Down

0 comments on commit 6cd10f2

Please sign in to comment.