### Identifying gene expression modules using an ICA matrix decompositon. 

#### Motivation 
It was recently reported that there are approximately 11m deaths due to sepsis annually (Rudd et al. 2020). Most of these deaths occur in the ICU, where patients succumb despite intensive care aimed to prevent organ dysfunction and failure (Sakr et al. 2018). It is well known that exaggerated innate immune responses, notably a massive cytokine storm-driven hyper-inflammation, result in early mortality in hospital patients with sepsis (Rittirsch et al. 2008; Hotchkiss et al. 2013a). At the same time, dysregulation of compensatory anti-inflammatory mechanisms and a global down-regulation of various cytokines (i.e., immunosuppression) occur, also contributing to mortality if these responses dominate (Boomer et al. 2011). However, the interplay of these mechanisms is not well understood, and importantly, the specific molecules dysregulated are unknown. Thus, transcriptomic studies that characterize the global gene expression profiles of patients who succumb to sepsis are lacking. 

The typical analysis performed on gene expression datasets is differential expression (DE) analysis, where some generalized linear model (implemented in packages like DESeq2, limma, edgeR) is used to determine the association between the expression of one gene with a phenotype of interest. Accordingly, genes (in part representative of their protein products) that function in units, such as co-expressed or co-regulated genes, may not be uncovered. These gene units, or modules, have been hypothesized to characterize complex disease processes accurately, and importantly, they do so in the context of functionally related genes (Chaussabel and Baldwin 2014; Saelens et al. 2018). 

#### Methods 
##### Discovery Cohort
Here I describe the identification of mortality-related gene sets/signatures from the gene expression profiles of severely-ill ICU patients. The ICU patients were recruited from a hospital in Toronto, Canada, within the first day of admission (except for two patients recruited from the hospital ward) (Table 3.1). These patients exhibited severe respiratory failure and/or suspected pulmonary sepsis.  Patients were excluded if death was impending (within 12 hours), if blood collection was unattainable, or consent was withheld. Of these, 27 (32.9%) were confirmed to be infected with SARS-CoV-2 by subsequent viral PCR. Furthermore, this cohort displayed severe symptomatology (7.2 ± 0.55 SOFA scores 24H post sampling) and high mortality (24.4% [20/82]). 

In [4]:
# Load required packages 
library(tidyverse)
library(magrittr)
library(fastICA) # Package with the fastICA algorithm 
library(functionjunction) # My package for data analysis 
library(enrichR)


In [5]:
# Read in data 
icu_dat <- read_rds("../create_tr_te/tr_te_dat.RDS")[["icu"]]
expr  <- icu_dat$expr
meta <- icu_dat$meta
all(colnames(expr)[-1] == meta$sample_identifier)

# Read in gene names
gene_names <- read_rds("../../../sepsis_rnaseq_all/final/counts_meta_10_read_filt_261120.RDS")$universe

In [6]:
# Let get some basic info  
dim(expr)
functionjunction::first_five(expr)
table(meta$mortality) 

Unnamed: 0_level_0,ensembl_gene_id,sepcv001T1,sepcv002T0,sepcv003T0,sepcv004T0
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,ENSG00000000419,706,216,384,495
2,ENSG00000000457,621,181,304,504
3,ENSG00000000460,100,18,74,78
4,ENSG00000000938,14387,10431,18687,21591
5,ENSG00000000971,61,15,11,16



survive    dead 
     60      20 

In [7]:
# RNA Seq data must be normalized prior to analysis due to heteroskedasticity of count data (variance depending on mean count)
# My go to is the variance stabilizing transformation by DESeq2. Other options exist, like TPM, FPKM, RPKM, rlog, etc. 
expr_vst <- expr %>% 
  column_to_rownames(var = "ensembl_gene_id") %>% 
  as.matrix() %>% 
  DESeq2::varianceStabilizingTransformation() 

converting counts to integer mode



##### ICA 

There are several statistical techniques to identify modules from a gene expression matrix, which mostly rely on unsupervised learning techniques such as correlation networks (e.g., Weighted Gene Correlation Network Analysis/ WGCNA), clustering (e.g., Hierarchical, K-means), and matrix decomposition (e.g., PCA, Independent Component Analysis/ICA). Prior benchmarking studies concluded that ICA-derived modules outperformed those obtained by other methods when comparing inferred modules to known regulatory modules in E. coli, S. cerevisiae, and humans (Rotival et al. 2011; Saelens et al. 2018). 

In the context of gene expression analysis, ICA aims to uncover underlying biological processes (e.g., mechanisms mediating transcriptional regulation, signaling cascades, immune responses, etc.) represented by gene modules that yield the observed gene expression events. In other words, the observed gene expression matrix can be considered a net sum of unobserved or latent biological processes, which is inferred by the ICA matrix decomposition (briefly explained below). There are two notable advantages to this technique. Firstly, ICA-derived modules enable genes to participate in more than one module, unlike clustering and correlation network approaches which divide genes into just one cluster/module based on a set distance cut-off, which is biologically implausible. Secondly, the ICA-derived modules are statistically independent by construction (or as independent as possible), which allows the components to map to distinct biological processes influencing gene expression. These components are reminiscent of principal components obtained using the related method PCA, although PCA components have no requirement to be statistically independent and thus fail to map to independent biological processes. 

$$
X_{ij} = \sum_{k=1}^{K} S_{ik} A_{kj} + \epsilon_{ij}
$$


I applied a pipeline previously used in the studies Teschendorff et al. (2007) and Rotival et al. (2011), which employs the FastICA algorithm to identify components (Hyvärinen and Oja 2000). Briefly, the input to the FastICA algorithm is a p × n gene expression matrix (where p represents genes and n represents patients/observations). A p × k “Signal” matrix is among the decomposed output matrix products. Here, the k columns represent a set of statistically-independent “components” (or random variables) describing the activation (or contribution) of the individual p genes in the various components. This approach was used to identify 37 components representing independent biological processes. The ICA components were reduced to a set of module genes that strongly influenced the component distribution (i.e., the genes at the extremes of distribution), as assessed by false discovery rate (FDR) estimates (Strimmer 2008; FDR<10-3). An “eigenmodule”, which represented the first principal component of the module gene expression matrix, was derived to summarize module gene expression into a single variable. This approach was inspired by the WGCNA method, which upon identifying highly correlated sets of gene modules summarizes module gene expression in a fashion identical to the one implemented (Langfelder and Horvath 2008). A multiple linear regression was performed to estimate the association of eigenmodules to eventual mortality, wherein the response variable was the eigenmodule, and covariates included binarized mortality status, age, sex, and transformed cell proportions (using PCA). Cell proportions were included in the linear model in case modules captured gene sets comprised of correlated gene expression of a specific cell type. Functional characterization of DE and module genes was performed using an overrepresentation analysis of Reactome and MSigDB Hallmarks (adjusted p-value ≤0.05) (Fabregat et al. 2018; Liberzon et al. 2015). 

In [8]:
# Write the function to perform ICA 
perform_ica <- function(expr_mat, kurtosis_keep = 10, qval_filt = 0.001, seed = 1) {
  
  # Determine the number of components to learn using PCA
  # Specifically, how many PCs explain 90% of variance in the expression matrix?  
  comp_no <- expr_mat %>% 
    t() %>% 
    as.data.frame() %>% 
    remove_zero_var_cols() %>% 
    perform_pca() %>% 
    pluck("pov") %>% 
    explain_95(perc = 95)
  cat(paste0("Performing ICA with ", comp_no, " components.\n"))
  
  # Perform ICA
  set.seed(seed)
  ICA <- fastICA::fastICA(expr_mat, n.comp = comp_no)
  cat("ICA Done.\n")
  
  # Get the S matrix and rename the components as Modules
  ICA_S <- ICA$S %>% 
    as.data.frame() %>% 
    set_names(paste0("Mod_",   str_pad(1:ncol(.), width = 3, side = "left", pad = "0")    ))

  # Keep certain components based on kurtosis. Components with a high degree of kurtosis indicates the distribution is skewed (deviates from normality), suggesting  
  comp_keep <- ICA_S %>% 
    map_dbl(~e1071::kurtosis(.x)) %>%  
    keep(~.x >= kurtosis_keep) %>% 
    names()
  ICA_S_filt <- ICA_S[,comp_keep]
  
  # Filter components of interest with FDR estimation
  ICA_mods <- ICA_S_filt %>% 
    map(~fdrtool::fdrtool(.x, plot = FALSE, verbose = FALSE))
  
  # Filter genes in each componennt 
  ICA_mods_filt <- list()
  for (comp in names(ICA_mods)) {
    mod_df <- data.frame(qval = ICA_mods[[comp]][["qval"]], ensembl_gene_id = rownames(ICA_S))
    mod_df <- mod_df %>% 
      dplyr::filter(qval <= qval_filt) %>% 
      left_join(gene_names, by = "ensembl_gene_id")
    ICA_mods_filt[[comp]] <- mod_df
  }
  
  return(list(ICA_S_filt = ICA_S_filt, ICA_mods_filt = ICA_mods_filt))
}

# PCA 
perform_pca <- function(expr_mat){
    # Perform PCA with prcomp
  df <- expr_mat %>% 
    as.data.frame() %>% 
    prcomp(center = TRUE, scale. = TRUE)
  
  PoV <- round(df$sdev^2/sum(df$sdev^2), digits = 3)
  x <- df$x %>% 
    as.data.frame() %>% 
    rownames_to_column(var = "sample_identifier") 
  rot <- df$rotation %>% 
    as.data.frame() %>% 
    rownames_to_column(var = "ensembl_gene_id")
  
  res <- list(x = x, rot = rot,  pov = PoV)
  cat(paste0("PCA ---- DONE.\n"))
  return(res)
}

# Quick helper function to extract the number of PCs explaining X percent of variation in the data 
explain_95 <- function(ev, perc){
  for (ind in 1:length(ev)){
    var_expl <- sum(ev[1:ind])
    if(var_expl > perc/100){
      return(ind)
    }
  }
}

In [9]:
# Now run the ICA pipeline 
ICA_res <- perform_ica(
    expr_mat = expr_vst, 
    kurtosis_keep = 10, 
    qval_filt = 0.001, 
    seed = 1
)

PCA ---- DONE.
Performing ICA with 56 components.
ICA Done.


In [10]:
# Now that I have modules, I need to determine whether the gene modules participate in established pathways/hallmarks using over-representation analysis. 
go_enrichment <- function(gene_list, p_val = 0.05, ID = "hgnc_symbol"){
  
  # Check if the gene list is empty
  if(length(gene_list) == 0){
    return(NULL)
  }
  
  gene_list = na.omit(gene_list)
  
  # Set res to NULL
  res <- NULL
  
  # Get results
  tryCatch({
    res <- enrichR::enrichr(genes = as.character(gene_list), 
                            databases = c("MSigDB_Hallmark_2020") 
    )
  }, error = function(e){cat("ERROR :", conditionMessage(e), "\n")} )
  
  if (is.null(res)){
    return(NULL)
  } else {
    res <- res %>% map(~dplyr::filter(.x, Adjusted.P.value <= p_val))
    return(res)
  }
}


In [11]:
# Perform mSigDB enrichment
sink()
ICA_mods_filt_msigdb_enr <-  ICA_res$ICA_mods_filt %>%
    map(~pull(.x, hgnc_symbol)) %>%
    map(~go_enrichment(.x,  p_val = 0.05))

Uploading data to Enrichr... Done.
  Querying MSigDB_Hallmark_2020... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying MSigDB_Hallmark_2020... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying MSigDB_Hallmark_2020... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying MSigDB_Hallmark_2020... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying MSigDB_Hallmark_2020... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying MSigDB_Hallmark_2020... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying MSigDB_Hallmark_2020... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying MSigDB_Hallmark_2020... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying MSigDB_Hallmark_2020... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying MSigDB_Hallmark_2020... Done.
Parsing results

In [None]:
get_mod_eigenvect <- function(ICA_mods, expr_mat) {
  
  eigenmodule <- ICA_mods$ICA_mods_filt %>%
    purrr::discard(~nrow(.x) == 0) %>% 
    map(~as.matrix(expr_mat)[.x$ensembl_gene_id, ]) %>%
    map(~t(.x) %>% perform_pca() %>% pluck("x")) %>%
    map(~dplyr::select(.x, one_of("sample_identifier", "PC1")))
  
  return(eigenmodule)
  
}

In [None]:
#### Get Eigenvectors
ICA_mod_eigen <- get_mod_eigenvect(ICA_res, expr_vst)

In [None]:
#### Associate modules to clinical vars_of_int
lm_clin_eig <- function(ICA_eig, meta, vars_of_int, deconv = NULL, incl_cell_props = FALSE) {
  met <- meta
  
  # Check things are in the right order 
  right_order <- ICA_eig %>% 
    map(~all(.x$sample_identifier == met$sample_identifier)) %>% 
    unlist() %>% 
    all()
  
  if (isFALSE(right_order)) {
    return(c("Things are not in the right order"))
  }
  
  # Design formula 
  des = "PC1 ~ "
  
  # Include cell proportions  
  if (incl_cell_props) { 
    
    if (is.null(deconv)) {
      stop()
    }
    
    deconv_pca <- deconv %>% 
      rownames_to_column(var = "sample_identifier") %>% 
      filter(sample_identifier %in% meta$sample_identifier) %>% 
      remove_zero_var_cols() %>% 
      column_to_rownames(var = "sample_identifier") %>% 
      perform_pca() 
    
    pcs_to_incl <- deconv_pca$pov %>% explain_95(perc = 90)
    cat(paste0("Cell Proportion PCs used: ", pcs_to_incl, "\n"))
    pcs_to_incl <- paste0("PC", 1:pcs_to_incl)
    
    deconv_pca_x <- deconv_pca$x %>% 
      dplyr::select(one_of("sample_identifier", pcs_to_incl)) %>% 
      dplyr::rename_at(vars(matches("PC")), ~ paste0("CiSo_", .))
    
    met <- met %>% 
      left_join(deconv_pca_x,  by = "sample_identifier")
    
    if (length(unique(met$sequencing_month_year)) > 1 ) {
      paste0(des, " sequencing_month_year + ") # Need to fix des = paste0() ...
    }
    
    des <- paste0(des, "gender + age + ", paste(paste0("CiSo_",pcs_to_incl), collapse = " + "), " + ")
  }
  
  # Initiate loop - loop through variables, and then module
  res <- list()
  for (var in  vars_of_int) {
    for (mod in names(ICA_eig)) {
      # Create a new df with the var and eigenvector the module
      lm_df <- ICA_eig[[mod]] %>% 
        left_join(met, by = "sample_identifier")
      
      # GLM - PC1 ~ var
      res[[var]][[mod]] <- glm(formula = formula(paste0(des,  var)), data = lm_df ) %>% 
        broom::tidy()
    }
  }

  return(list(res = res))
} 

In [None]:
deconv_res <- read_rds("./create_tr_te/deconv_res.rds") %>% 
  map(~map(.x, ~as.data.frame(.x)))
deconv_res <- deconv_res$icu$cibersort

ICA_mod_eigen_clin <- lm_clin_eig(ICA_mod_eigen, meta, vars, deconv_res, TRUE) 

In [None]:
# Get MsigDB hallmarks for each module
msigdb = ICA_mod_plot$expr_uncorr$msigdb[ICA_mod_expr_dir$comp]

ICA_mods_filt_msigdb_enr_df <- msigdb %>%
  map(~keep(.x, names(.) %in% c("MSigDB_Hallmark_2020"))) %>%
  map(~bind_rows(.x, .id = "database")) %>%
  bind_rows(.id = "comp") %>% 
  separate(Overlap,into = c("M", "N")) %>%
  mutate(Ratio = as.numeric(M)/as.numeric(N)*100 ) %>%
  mutate(Term = case_when(Term == "heme Metabolism" ~ "Heme Metabolism", TRUE ~ Term))  %>% 
  group_by(comp) %>%
  top_n(2, Ratio) %>%
  ungroup() %>% 
  dplyr::select(one_of("comp", "Term", "Ratio")) %>% 
  right_join(data.frame(comp = names(msigdb))) %>% 
  #right_join(data.frame(comp = ICA_mod_expr_dir$comp  )) %>% 
  mutate(Ratio = ifelse(is.na(Ratio), 0, Ratio)) %>% 
  mutate(Term = ifelse(is.na(Term), "None", Term)) %>% 
  pivot_wider(id_cols = "Term", names_from = "comp", values_from = "Ratio") %>% 
  na2zero() %>% 
  filter(!Term == "None") %>% 
  mutate(database = "mSigDB")


In [None]:
# Get the clinical association between each module and an outcome   
ICA_mod_eigen_clin_df <- ICA_mod_eigen_clin$expr_uncorr$res %>%
  map(~bind_rows(.x, .id = "comp")) %>%
  bind_rows(.id = "variable") %>%
  filter(term != "(Intercept)" ) %>% 
  mutate(adj_p_value = p.adjust(p.value, method = "BH")) %>% 
  filter(str_detect(term,  paste(vars, collapse = "|") )) %>% 
  mutate(p_val_bin = case_when(
    adj_p_value >0.05 ~ "P>0.05", 
    adj_p_value >0.01 ~ "P>0.01",
    adj_p_value >0.001 ~ "P>0.001",
    adj_p_value >0.0001 ~ "P>0.0001", 
    TRUE ~ "P<0.0001")
    ) %>% 
  mutate(log10_p.value = -1*log10(p.value)) %>% 
  dplyr::select(one_of("variable", "comp", "log10_p.value", "p_val_bin", "p.value")) %>% 
  pivot_wider(id_cols = "comp",names_from = "variable", values_from = "p_val_bin") %>% 
  #arrange(match(comp, colnames(ICA_mods_filt_df)[-c(1:2)] ))
  arrange(match(comp, ICA_mod_expr_dir$comp))

#### Start making the heatmap
all(colnames(ICA_mods_filt_df)[-c(1:2)] == ICA_mod_eigen_clin_df$comp)
colnames(ICA_mods_filt_df) <- str_replace(colnames(ICA_mods_filt_df), "_", " ")

library(ComplexHeatmap)
#col_fun <- circlize::colorRamp2(c(0,1), c( "white", "#330033"))
col_fun = structure(c( "gray95", "#330033"), names = c( "Not Enr.", "Enr. Hallmark"))
col_assoc = c("P>0.05" = "grey60", 
              "P>0.01" = "darkgoldenrod1",
              "P>0.001"= "darkmagenta",
              "P>0.0001"= "mediumblue",
              "P<0.0001"= "midnightblue"
              ) 
col_dir = c("Up" = "firebrick" ,"Down"="darkgreen", "Mixed" = "blue")
column_ha <- HeatmapAnnotation(
  #Culture = ICA_mod_eigen_clin_df$culture,
  #"Organ Dysf. Group" = ICA_mod_eigen_clin_df$sofa_24,
  "Eigenmod/Mortality Assoc." = ICA_mod_eigen_clin_df$mortality,
  "Module Expr Direction" = ICA_mod_expr_dir$Direction,
  col = list(
    #Culture = col_assoc,
    #"Organ Dysf. Group" = col_assoc,
    "Eigenmod/Mortality Assoc." = col_assoc,
    "Module Expr Direction" = col_dir),
  annotation_legend_param = list(
    # Culture = list(
    #   title = "Eigenmod/\nEndpoint Assoc. \n-log10(AdjPValue)"),
    #"Organ Dysf. Group" = list(
    #  title = "Eigenmod/\nEndpoint Assoc. \n-log10(AdjPValue)"),
    "Eigenmod/Mortality Assoc." = list(
      title = "Eigenmod/\nEndpoint Assoc. \n-log10(AdjPValue)"
      #labels = c("P=1", "P=0.05", "P=0.01","P=0.001","P=0.0001" )
      #at = c(0,1.3,2,3,4)
      ),
    "Module Expr Direction" = list(
      title = "Module Expr Direction"
    )
  ),
  border = TRUE,
  #annotation_width=unit(c(2, 5.0), "cm"),
  simple_anno_size = unit(1, "cm"), 
  gap = unit(1, 'mm'),
  show_legend = c(TRUE, TRUE, TRUE)
  )

htmap <- ICA_mods_filt_df %>% 
  filter(database == "mSigDB") %>% 
  dplyr::select(-database) %>% 
  column_to_rownames(var = "Term") %>% 
  as.matrix() %>% 
  Heatmap(
    name = "\nRatio",
    column_split = ICA_mod_expr_dir$Direction,
    cluster_rows = FALSE,
    cluster_columns = FALSE,
    top_annotation = column_ha,
    col = col_fun,
    border = TRUE
    )