### Identifying gene expression modules using an ICA matrix decompositon. 

The typical analysis performed on gene expression datasets is differential expression (DE) analysis, where some generalized linear model (implemented in packages like DESeq2, limma, edgeR) is used to determine the association between the expression of one gene with a phenotype of interest.  

Accordingly, genes (in part representative of their protein products) that function in units, such as co-expressed or co-regulated genes, may not be uncovered. These gene units, or modules, have been hypothesized to characterize complex disease processes accurately, and importantly, they do so in the context of functionally related genes (Chaussabel and Baldwin 2014; Saelens et al. 2018). There are several statistical techniques to identify modules from a gene expression matrix, which mostly rely on unsupervised learning techniques such as correlation networks (e.g., Weighted Gene Correlation Network Analysis/ WGCNA), clustering (e.g., Hierarchical, K-means), and matrix decomposition (e.g., PCA, Independent Component Analysis/ICA). Prior benchmarking studies concluded that ICA-derived modules outperformed those obtained by other methods when comparing inferred modules to known regulatory modules in E. coli, S. cerevisiae, and humans (Rotival et al. 2011; Saelens et al. 2018). 

In the context of gene expression analysis, ICA aims to uncover underlying biological processes (e.g., mechanisms mediating transcriptional regulation, signaling cascades, immune responses, etc.) represented by gene modules that yield the observed gene expression events. In other words, the observed gene expression matrix can be considered a net sum of unobserved or latent biological processes, which is inferred by the ICA matrix decomposition (briefly explained below). There are two notable advantages to this technique. Firstly, ICA-derived modules enable genes to participate in more than one module, unlike clustering and correlation network approaches which divide genes into just one cluster/module based on a set distance cut-off, which is biologically implausible. Secondly, the ICA-derived modules are statistically independent by construction (or as independent as possible), which allows the components to map to distinct biological processes influencing gene expression. These components are reminiscent of principal components obtained using the related method PCA, although PCA components have no requirement to be statistically independent and thus fail to map to independent biological processes. 

I applied a pipeline previously used in the studies Teschendorff et al. (2007) and Rotival et al. (2011), which employs the FastICA algorithm to identify components (Hyvärinen and Oja 2000). Briefly, the input to the FastICA algorithm is a p × n gene expression matrix (where p represents genes and n represents patients/observations). A p × k “Signal” matrix is among the decomposed output matrix products. Here, the k columns represent a set of statistically-independent “components” (or random variables) describing the activation (or contribution) of the individual p genes in the various components. This approach was used to identify 37 components representing independent biological processes. The ICA components were reduced to a set of module genes that strongly influenced the component distribution (i.e., the genes at the extremes of distribution), as assessed by false discovery rate (FDR) estimates (Strimmer 2008; FDR<10-3). An “eigenmodule”, which represented the first principal component of the module gene expression matrix, was derived to summarize module gene expression into a single variable. This approach was inspired by the WGCNA method, which upon identifying highly correlated sets of gene modules summarizes module gene expression in a fashion identical to the one implemented (Langfelder and Horvath 2008). A multiple linear regression was performed to estimate the association of eigenmodules to eventual mortality, wherein the response variable was the eigenmodule, and covariates included binarized mortality status, age, sex, and transformed cell proportions (using PCA). Cell proportions were included in the linear model in case modules captured gene sets comprised of correlated gene expression of a specific cell type. Functional characterization of DE and module genes was performed using an overrepresentation analysis of Reactome and MSigDB Hallmarks (adjusted p-value ≤0.05) (Fabregat et al. 2018; Liberzon et al. 2015). 