# Hidden Factor Analysis

## Overview

This workflow implements three procedures for hidden factor analysis for omcis data:

1. The [Probabilistic Estimation of Expression Residuals (PEER) method](https://github.com/PMBio/peer/wiki/Tutorial), a method also used for GTEx eQTL data analysis. 
2. Factor analysis using Bi-Cross validation, Owen, Art & Wang, Jingshu. (2015). Bi-Cross-Validation for Factor Analysis. Statistical Science. 31. 10.1214/15-STS539. with software package `APEX` (Corbin Quick, Li Guan, Zilin Li, Xihao Li, Rounak Dey, Yaowu Liu, Laura Scott, Xihong Lin, bioRxiv 2020.12.18.423490; doi: https://doi.org/10.1101/2020.12.18.423490)
3. PCA with automatic determination of the number of factors to use. This is mainly inspired by a [recent benchmark from Jessica Li's group](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02761-4).


We chose to use a PCA based approach for the xQTL project, although additional considerations should be taken for single-cell eQTL analysis as investigated in [this paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02873-5).

## Input

1. `protocol_example.protein.bed.gz`
2. `protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.gz`

## Output

1. `protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.residual.bed.gz`
2. `protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz`

## Minumal Working Example

The proteomics data used in this MWE can be found on [synapse](https://www.synapse.org/#!Synapse:syn52369482).

### Step 1: Compute Residual on Merged Conflicts and Perform Hidden Factor Analysis
Timing: < 1 minute

In [None]:
sos run pipeline/covariate_hidden_factor.ipynb Marchenko_PC \
   --cwd output/covariate \
   --phenoFile output/phenotype/protocol_example.protein.bed.gz  \
   --covFile output/covariate/protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.gz \
   --mean-impute-missing \
   --container singularity/PCAtools.sif

```
INFO: Running computing residual on merged covariates:
INFO: computing residual on merged covariates is completed.
INFO: computing residual on merged covariates output:   output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.residual.bed.gz
INFO: Running Marchenko_PC_2:
INFO: Marchenko_PC_2 is completed.
INFO: Marchenko_PC_2 output:   output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz
INFO: Workflow Marchenko_PC (ID=wb2797da52c726752) is executed successfully with 2 completed steps.
```

## Command interface

In [None]:
sos run covariate_hidden_factor.ipynb -h

In [None]:
usage: sos run pipeline/covariate_hidden_factor.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  Marchenko_PC
  PEER
  BiCV

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files. MUST BE FULL
                        PATH
  --covFile VAL (as path, required)
                        Merged Covariates File
  --phenoFile VAL (as path, required)
                        Path to the input molecular phenotype data.
  --name  f'{phenoFile:bnn}.{covFile:bn}'

  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 8 (as int)
                        Number of threads
  --container ''
                        Software container option
  --entrypoint  ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""


Sections
  *_1:
    Workflow Options:
      --[no-]mean-impute-missing (default to False)
  Marchenko_PC_2:
  PEER_2:
    Workflow Options:
      --N 0 (as int)
                        N PEER factors, If do not specify or specified as 0,
                        default values suggested by GTEx (based on different
                        sample size) will be used
      --iteration 1000 (as int)
                        Default values from PEER software: The number of max
                        iteration
      --tol 0.001 (as float)
                        Prior parameters parameter: Alpha_a = 0.001 parameter:
                        Alpha_b = 0.1 parameter: Eps_a = 0.1 parameter: Eps_b =
                        10.0 Tolarance parameters
      --[no-]r2-tol (default to False)
                        parameter: var_tol = 0.00001 minimum variance explained
                        criteria to drop factors while training
      --convergence-mode fast
                        Convergence mode: Convergence mode for MOFAr "slow",
                        "medium" or "fast", corresponding to 1e-5%, 1e-4% or
                        1e-3% deltaELBO change.
  PEER_3:
  BiCV_2:
  BiCV_3:
    Workflow Options:
      --N 0 (as int)
                        N factors, if not specify, calculated based on sample
                        size according to GTeX
      --iteration 10 (as int)
                        The number of iteration

### Step 1: Compute Residual on Merged Conflicts and Perform Hidden Factor Analysis

In [None]:
[*_1(computing residual on merged covariates)]
parameter: mean_impute_missing = False
input: phenoFile, covFile
output: f'{cwd}/{name}.residual.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand = "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout' , container = container, entrypoint = entrypoint

    mean_impute <- function(d){
      f <- apply(d, 2, function(x) mean(x, na.rm = TRUE))
      for (i in 1:length(f)) d[,i][which(is.na(d[,i]))] <- f[i]
      return(d)
    }
  
    library(dplyr)
    library(readr)
    pheno = read_delim(${_input[0]:r},delim = "\t")
    covariate= read_delim(${_input[1]:r},delim = "\t") 

    # Extract samples in both files (Outliers will be removed in here as they are no longer presented in the header of factor)
    extraction_sample_list <- intersect(colnames(pheno), colnames(covariate)) 
    
    
    if(length(extraction_sample_list) == 0){
      stop("No samples are overlapped in two files!")
    }
    
    ## Report sample counts:
    print(paste((ncol(pheno) - 4), "samples are in the phenotype file", sep = " "))
    print(paste((ncol(covariate) - 1), "samples are in the covariate file", sep = " "))

    ## Report identical samples:
    print(paste(length(extraction_sample_list), "samples overlap between phenotype & covariate files and are included in the analysis:", sep = " "))
    print(extraction_sample_list)

    ## Report non-overlapping samples :
    covariate_missing = covariate %>% select(-all_of('#id')) %>% select(-all_of(extraction_sample_list))
    print(paste(ncol(covariate_missing), "samples in the covariate file are missing from the phenotype file:", sep = " "))
    print(colnames(covariate_missing))

    pheno_missing = pheno %>% select(-c(1:4)) %>% select(-all_of(extraction_sample_list))
    print(paste(ncol(pheno_missing), "samples in the phenotype file are missing from the covariate file:", sep = " "))
    print(colnames(pheno_missing))

    # Subset the data:
    covariate = covariate[,extraction_sample_list]%>%as.matrix()%>%t()
    pheno_id = pheno%>%select(1:4)
    pheno = pheno%>%select(all_of(rownames(covariate)))%>%as.matrix()%>%t()
    if (${"T" if mean_impute_missing else "F"}) {
      pheno = mean_impute(pheno)
    } else {
      if(sum(is.na(pheno)) > 0){ stop("NA in phenotype input is not allowed!") }
    }
    # Get residual 
    pheno_resid = .lm.fit(x = cbind(1,covariate), y = pheno)$residuals
    pheno_output = cbind(pheno_id, pheno_resid%>%t())
    pheno_output%>%write_delim("${_output:n}",delim = "\t")
  
bash: expand = "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', container = container, entrypoint = entrypoint
    bgzip -f ${_output:n}
    tabix -p bed ${_output}

    stdout=${_output:n}.stdout 
    for i in ${_output} ; do 
    echo "output_info: $i " >> $stdout;
    echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
    echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
    echo "output_column:" `zcat $i | head -1 | wc -w `   >> $stdout;
    echo "output_preview:"   >> $stdout;
    zcat $i | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

In [None]:
[Marchenko_PC_2]
output: f'{cwd}/{_input:bnnn}.Marchenko_PC.gz'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint = entrypoint
    library("dplyr")
    library("readr")
    library("PCAtools")
    library('BiocSingular')
    residExpPath  = "${_input}"
    covPath = "${covFile}"
    residExpDF <- read_delim(residExpPath, show_col_types=FALSE)
    covDF <- read_delim(covPath, show_col_types=FALSE)
    commonMPSamples <- intersect(colnames(covDF), colnames(residExpDF))
    covDFcommon <- cbind(covDF[, 1], covDF[, commonMPSamples])
  
    residExpPC <- pca(
        residExpDF[,commonMPSamples], # The first four columns are: chr, start, end, and gene_id; so we skip those.
        scale = TRUE,
        center = TRUE,
        BSPARAM = ExactParam())
    M <- apply(residExpDF[, commonMPSamples], 1, function(X){ (X - mean(X))/sqrt(var(X))});
    residSigma2 <- var(as.vector(M));
    paste('sigma2:', residSigma2)
    
    MPPCNum <- chooseMarchenkoPastur(
        .dim = dim(residExpDF[, commonMPSamples]), var.explained=residExpPC$sdev^2, noise=residSigma2)
    
    MPPCsDF <- as.data.frame(residExpPC$rotated[, 1:MPPCNum])

    MPColMatrix <-  matrix(c(rep('Hidden_Factor_PC', times=MPPCNum), seq(1, MPPCNum)), ncol=2, nrow=MPPCNum)
    colnames(MPPCsDF) <- apply(MPColMatrix, 1, function(X){return(paste0(X[1], X[2]))})
    rownames(MPPCsDF) <- colnames(residExpDF[, 5:ncol(residExpDF)])
    # Add #id Column
    MPPCsDF <- as.data.frame(t(MPPCsDF))
    MPPCsDF$id <- rownames(MPPCsDF)
    MPPCsDF <- MPPCsDF %>% select(id, everything()) %>% rename("#id" = "id")
    write_delim((rbind(covDFcommon, MPPCsDF)), "${_output}", "\t")
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `zcat $i | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done