Skip to content
Md Badiuzzaman Pranto edited this page May 17, 2024 · 22 revisions

Welcome to the grn-benchmark wiki! This wiki contains all information on the tools!

Tools:

Z-score:

Methodology:

  1. Computes Pearson's or Spearman's correlation coefficients $\rho_{X, Y}$ for each pair of genes X and Y: $\rho_{X, Y} = \frac{cov(X, Y)}{\sigma_X \sigma_Y}$
  2. Apply Fisher z-transformation afterwards: $\rho_g\prime = artanh(\rho_g)$ where $g \in \lbrace X, Y \rbrace $
  3. Model the resulting z-scores as two normal distributions with means $\rho_g\prime$ and standard deviation $\frac{1}{\sqrt{N_g - 3}}$ where $N_g$ is the sample size of the respective genes
  4. Apply the z-test: $Z = \frac{\rho_X - \rho_Y}{\sqrt{\frac{1}{N_X - 3} + \frac{1}{N_Y - 3}}}$
  5. Keep significant edges.

Rationale:

Usage of an established statistical test. Interactions are easily explainable afterwards.

Parameters:

None.


MODA-Wrapper:

Methodology:

  1. Takes input $X \in \mathbb{R}^{p\times n}$, where $p$ is the genes and $n$ is the experimental samples, and transforms it to $X \in \mathbb{R}^{n\times p}$ which then get passed to MODA.

  2. Fixes MODA's Hierarchical clustering which produces dense modules and their corresponding gene IDs but however does not generate the corresponding gene names.

  3. Determines the optimal cutting height for a any given module $A$, by using average density of the module, defined as: $$A_{Density} =\dfrac{ \sum_{i \in A} \sum_{j \in A, j\neq i} a_{ij}}{n_A(n_a-1)}$$ where $a_{ij}$ is the similarity between gene $i$ and gene $j$, and $n_A$ is the number of genes in $A$.

  4. Performs:

  1. Generates differentially expressed and conserved gene list.

Rationale:

A consistent and user-friendly way to get MODA up and running quickly without dependency issues and additionally adds new capabilities to MODA.


MATLAB Tool for Modulated Gene/Gene Set Interaction (MAGIC) Analysis:

Methodology:

Note: C (condition) is used instead of M (modulator) to denote conditions, to reflect the generalized utility of the tool.

  1. To quantify the interaction between genes:
    Pearson correlation coefficients $I_{C}$ between gene pairs $i$ and $j$, for each condition $C\epsilon{[0,1]}$ separately
    $I_{C}(i,j) = corr(E_{C}(i,:), E_{C}(j,:))$ where $E_{C}(i,:)$ and $E_{C}(j,:)$ are the expression levels of genes $i$ and $j$ in vector format
  2. To identify gene pairs that exhibit a significant difference in their interaction across the two conditions:
    rely on conjugate Fisher-inverse Fisher transformation and two criteria(Modulation Test and Modulation Score)
  3. Criterion 1: Using Modulation Test
  • Assumption: correlation is independent of the condition
  • To handle bias from sample sizes $G_{0}$ and $G_{1}$:
    Project the correlation coefficients to a normally distributed, sample-size-free domain through Fisher transformation:
    $I_{C}^{F} = F(I_{C},\eta_{C})) = \frac{\sqrt{\eta_{C}-3}}{2}ln\frac{1+I_{C}}{1-I_{C}}$
    where $\eta_{C}$ is the sample size under condition C.
  • To estimate the statistical significance of difference in correlation:
  • Null hypothesis: $H_{0}: \Delta I^{F} = |I_{C=1}^{F}| - |I_{C=0}^{F}| = 0$
  • Hypothesis: $H_{0}: \Delta I^{F} = |I_{C=1}^{F}| - |I_{C=0}^{F}|\ne 0$
  • Using normal distribution of Fisher transformation, the probability density function (PDF) and the cumulative distribution function (CDF) of $\Delta I^{F}$ are used to determine the modulation test p-value directly:
  • PDF: $\mathcal{f}(\Delta I^{F}) = \frac{1}{\sqrt{\Pi}}\mathcal{e}^{-\frac{1}{4}\Delta I^{F2}}\cdot [1 - erf\ \left( \frac{\left| \Delta I^{F} \right|}{2}\right)]$
  • CDF: $\mathcal{F}(\Delta I^{F}) = \frac{1}{2} + erf\left( \frac{\Delta I^{F}}{2}\right) - \frac{1}{2}sgn(\Delta I^{F})\cdot \left[ erf\left( \frac{\Delta I^{F}}{2} \right) \right]$
    • where $erf()$ is the Gauss error function and $sgn()$ is the sign function
  • The statistical significance threshold is user-defined and applied directly on the computed raw p-values if bonf is set to 0.
  • If bonf is set to 1, the threshold is adjusted for multiple hypothesis testing using Bonferroni correction to account for the multiple gene interactions.
  • If p-value is $\le$ the significance threshold, the null hypothesis that the condition has no effect on that gene pair is rejected and the interaction is considered significant.
  1. Criterion 2: Using Modulation Score
  • To return to assigned sample size $\eta_{C}$:
    inverse Fisher transformation
    $I_{C}^{adj}=\mathcal{F}^{-1}(I_{C}^{F},\eta_{C})=\frac{1}{\sqrt{\eta_{C}-3}}\frac{e^{2I^{F}_{C}}-1}{e^{2I^{F}_{C}}+1}$ where $\eta_{C}$ is the user-defined sample size.
  • Following this transformation, the correlation coefficients sampled from the two populations can used to determine the modulation score: $\Delta I^{adj}=\left|I^{adj}_{C=1}\right|-\left|I^{adj}_{C=0}\right|$
  • To determine the biological significance of the interaction, a user-defined threshold $\Delta I_{th}$ is used:
  • Interactions specific to Condition 2:
    $\Delta I^{adj}\gt \Delta I_{th}$
  • Interaction specific to Condition 1:
    $\Delta I^{adj}\lt - \Delta I_{th}$

Gene Set Application: To apply gene set interaction analysis, instead of gene expression $E_{C}$, gene set enrichment scores $ES_{C}$ are used. These scores are obtained as follows:

  1. Z-transformation of gene expression data: $z_{kn} = \frac{(x_{kn}-\mu_{k}))}{\sigma_{k}}$ where
  • $x_{kn}$ : gene expression data of k-th gene in n-th sample
  • $\mu_{k}$ : mean of gene expression in k-th gene
  • $\sigma_{k}$ : standard deviation of gene expression in k-th gene In matrix format: $Z_{C} = \left\{ z_{kn} \right\}_{C}$
  1. Gene Set Content Matrix $G$:
  • create matrix with dimensions S by K, where S is the number of gene sets and K is the number of genes, respectively.
  • if gene set $s\epsilon S$ contains gene $k\epsilon K$: $G(s,k)=\frac{1}{n_{s}}$ where $n_{s}$ is the number of genes in s
  • if gene set $s\epsilon S$ does not contain gene $k\epsilon K$: $G(s,k)=0$
  1. Gene Set Enrichment Score representing degree of activation for gene set $s\epsilon S$ in a sample $n\epsilon N$ under a given condition $C\epsilon [0,1]$:
  • calculated by taking the inner average of all z-scores of the genes in a gene set for one sample: $a_{sn}=\frac{1}{n_{s}}\sum_{k\epsilon s}z_{kn}$ for n-th sample in s-th gene set In matrix format: $ES_{C}=G\cdot Z_{C}$
  1. Filtering non-informative gene sets using $L_{0.05}$ criterion:
  • Assumption: standard normal distribution of gene set enrichment scores, by law of large numbers
  • $L_{0.05}$ criterion: $\mp 1.96$ times the standard deviation of enrichment scores used as an informativeness measure for the gene sets
  • filter out gene sets whose enrichment scores fall within $L_{0.05}$ boundary for more than %80 of samples, as they are considered non-informative
  1. The rest of the steps are the same as above, with $ES_{C}$ replacing $E_{C}$

Rationale:

Identifying gene pairs with differential interactions under two conditions, based on the assumption that a condition which is relevant to a biological system significantly impact a network of genes when it is present, but will have a minimal effect when it is absent.


SCANet:

Methodology:

  1. Creates representative cells to reduce the computational complexity by grouping neighboring cells in the preprocessed scRNA-seq data, which is an expression matrix with rows representing individual cells and columns representing genes.
  2. Utilizes modified Weighted Gene Co-expression Network Analysis (WGCNA) to infer modules of strongly co-expressed genes.
  3. Allows users to explore and visualize the inferred gene modules, helping them identify specific modules of interest for further investigation.
  4. Infers gene regulatory networks (GRN) of the modules of interest using GRNBoost2 algorithm. RcisTarget is then used to ensure accuracy through cis-regulatory TF-binding motif enrichment analysis.
  5. Suggests existing drugs that could potentially influence these gene networks by comparing them with drug databases.

Rationale:

SCANet offers an integrated approach that combines individual techniques to improve the understanding of gene networks and potential drug targets.

Parameters:

  • Number of representative cells per cluster.
  • Correlation cutoff, which Specifies the minimum threshold for co-expression strength that defines whether an edge should be included in the network. Edges with co-expression values below this threshold will be excluded from the network.

Discordant:

Methodology:

  1. Computes the correlation coefficients for each feature pair in two conditions. Available correlation metrics are Spearman's, Pearson's, BWMC, and SparCC.
  2. Utilize the Expectation maximization(EM) algorithm to fit a mixture model (3 or 5 compnents) to the correlation coefficients for each condition.
  3. Estimate posterior probability (pp) of each feature pair belonging to each component for each condition.
  4. Compute differential correlation posterior probability by summing pp where a feature pair belong to two different components between conditions.
  5. Features pairs with differential correlation posterior probability above a threshold are considered as differentially co-expressed.
  6. Build a network where nodes are differentially co-expressed gene pairs and edges are changes in correlation across conditions.

Note : EM with subsampling option available to reduce computational cost for large datasets.

Rationale:

Discordant, a powerful tool to identify differential co-expression, provides novel flexibility allowing the user to choose the correlation metric, number of components, and subsampling option.

Parameters:

  • cor.method: Method to compute the correlation coefficient .
  • components: Number of components for the mixture model.
  • subsampling: whether to subsample the correlation coefficients during fitting.

DGCA-CLI-TOOL

Rationale:

This tool is a command-line utility designed as a wrapper around the DGCA package, facilitating Differential Gene Correlation Analysis between two conditions. It aids in identifying significant changes in gene correlation across different biological conditions, supporting advanced biological insights into gene regulatory mechanisms

Methodology

This section describes the methodology implemented by the CLI tool to perform Differential Gene Correlation Analysis using the DGCA package. The process is structured into several steps, starting from the initial data input to the final output generation, as detailed in below.

Workflow of the DGCA Command-Line Tool

Requirements:

  • Two gene expression files of distinct conditions

Output:

  • network.tsv file containing the differential gene correlation analysis results. A detailed mathematical explanation of the network.tsv file's weights can be found in the README.md file of my CLI tool’s repository.

Steps:

  1. Input File:

    • Takes gene expression data of two conditions as input.
  2. Prepare Inputs for DGCA Package:

    • Construct a matrix of gene expression values.
    • Create a design matrix specifying the conditions associated with samples.
    • Specify the conditions for comparison.
  3. Execute DGCA:

    • Run the DGCA analysis using the prepared input matrices from step 2 to generate results.
  4. Format Conversion of Results:

    • Convert the raw results from DGCA into the required format for the project.
  5. Generate Output File:

    • Output the final results into a network.tsv file which contains the network of differentially correlated genes as computed by the DGCA.
  6. Return Output:

    • The process returns network.tsv containing the results of the differential gene correlation analysis.

Parameters:

  • --input_file_1: Path to the TSV file containing gene expression data for the first condition.
  • --input_file_2: Path to the TSV file containing gene expression data for the second condition.
  • --output_path: Output path where the results will be saved. The results file will be named 'network.tsv'.

CODC-CLI-TOOL

Rationale

The CODC CLI tool is designed for analyzing gene expression data to calculate differential co-expression using a copula-based approach. It is implemented in Python based on the R implementation of Ray et al. for enhanced performance, with support for parallel processing. The tool allows users to compute differential co-expression networks and provides additional commands for downstream analysis and performance measurement. Installation can be done via Docker or locally using PDM, a Python package manager. The tool expects input files in TSV format and outputs the co-expression network as a TSV file as well.

Methodology:

The methodology adopted for identifying differential gene co-expression using a copula-based approach is systematically structured into computational steps that transform raw gene expression data into a quantifiable measure of differential co-expression. Here, I detail each computational step and the mathematical underpinnings that guide the analyses.

The source code of R's copula package, which do this calculation on the R implementation can be found here

Conceptual Framework

Pseudo-Observations

Converting raw data samples into pseudo-observations is a crucial step in copula-based analysis. This process involves ranking the data within each variable and normalizing these ranks to the unit interval $[0, 1)$, which prepares them for further analysis using empirical copulas. The implementations use the equation 1 for this purpose. There is no direct method in R which implements this, but they did it indirectly. The python implementation pseudo_observations() function can be found under copula/empirical_copula.py file.

$$ p_{ij} = \frac{\text{rank}(x_{ij})}{n + 1} [equation 1] $$

Here, $p_{ij}$ is the pseudo-observation corresponding to $x_{ij}$. The function $\text{rank}(x_{ij})$ assigns a rank to each data point based on its value relative to other samples in the same variable, with adjustments based on the chosen ties method (e.g., 'average', 'min', 'max', 'dense', 'ordinal').

Empirical Copula

A copula is a statistical tool used to describe the dependency between random variables. The empirical copula specifically captures the joint distribution of variables by transforming their margins to uniform distributions on the interval $[0, 1]$.

$$ C(u, v) = \text{Prob}(U \leq u, V \leq v) [equation 2] $$

where $U$ and $V$ are the uniform transformations (pseudo-observations) of the original variables $X$ and $Y$, representing gene expressions.

Comparing Empirical Copulas with the KS Test

The Kolmogorov-Smirnov test is employed to compare the empirical copulas of a gene pair across two conditions using the equation, quantifying the maximum distance between their cumulative distribution functions.

$$ D = \sup_x |F_{1,n}(x) - F_{2,n}(x)| [equation 3] $$

where $F_{1,n}$ and $F_{2,n}$ are the empirical distribution functions of the gene pair under the first and second conditions, respectively.

Copula-based Differential Co-Expression

The equation 4 is proposed by the authors to calculate the final distance, which utilizes the previous equations in order to calculate the differential co-expression of a particular gene pair.

$$ \begin{split} DC_{Copula}(g_i, g_j) &= D(C(g_i^{condition 1}, g_j^{condition 1}), \\ &\quad C(g_i^{condition 2}, g_j^{condition 2})) [equation 4] \end{split} $$

where $g_i$ and $g_j$ are genes being compared, and $DC_{Copula}(g_i, g_j)$ is the differential coexpression score for the gene pair $(g_i, g_j)$.

Computational Steps

Requirements:

  • Input file 1
  • Input file 2
  • Output path
  • Ensure the correctness of input format and existence of the output directory

Procedure:

  1. Data Preparation:

    • Verify that both matrices contain identical genes in the same order.
  2. Pairing and Transforming Gene Data:

    • For each gene pair (i) and (j), construct two matrices for each condition by horizontally stacking their expression profiles.
  3. Generating Pseudo-Observations:

    • Convert expression levels to a scale of [0, 1] by ranking the expression values and scaling these ranks using the specified equation 1.
  4. Empirical Copula Construction:

    • Establish the joint distribution function using equation 2, for each gene pair in each condition based on the uniform pseudo-observations.
  5. Differential Coexpression Calculation:

    • Apply the KS test to the empirical copulas from both conditions to determine the maximum divergence in their distribution behaviors. Use equation 3 on this purpose.
  6. Compiling and Reporting Results:

    • Output a network.tsv file detailing the differential coexpression network, specifying target, regulator, condition, and weight for each gene pair.

Return:

  • network.tsv

Parameters:

--input_file_1 : Path to the TSV file containing gene expression data for the first condition.

--input_file_2: Path to the TSV file containing gene expression data for the second condition.

--output_path : The directory where the output TSV file will be saved. This file will contain the computed differential co-expression network based on copula approach.

--ties_method : Method to handle ties in data ranking within the pseudo-observations calculation.

--smoothing: Specifies the smoothing technique applied to the empirical copula calculation.

--ks_stat_method: Determines the method used for computing the Kolmogorov-Smirnov statistic, which quantifies the differential co-expression.

--batch_size: Determines how many pair of genes will be executed in each batch in parallel execution.


Add your tools here!