# PEER factor analysis

## Overview: 
* Baseline docker image is from: rticode/peer:1.3_8f2237f (will be update to our own)
* An Rscript for get the default value of priors and tolerance is in  Run_peer.R

## Input and output:
### Input: 
* --expression_file: The expression matrix: G+1 rows, each row is a gene, N + 1 columns, each column is a sample. This can be obtained easily from the Dapars ressult file. 

    * This input need to be illustrated. For PEER, the matrix is assumed to have **N rows and G columns**, where N is the number of samples, and G is the number of genes. For us, the input is supposed to be here. Modification to PEER format have been done in `Run_peer_simple_version.R`

In [1]:
df <- read.table("Peer_example_data/Peer_example_data.txt", row.names = 1, header = T)
df[1:5,1:5]

Unnamed: 0_level_0,Sample1,Sample2,Sample3,Sample4,Sample5
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Gene1,-1.627455,1.684725,-0.0807924,1.510522,0.5804513
Gene2,-12.626715,17.746163,-14.2519447,5.254536,14.8258933
Gene3,3.33739,-7.259074,6.2643526,-1.375911,-4.7039364
Gene4,4.468973,-4.17925,6.2956787,1.316271,-3.5877553
Gene5,-4.945527,8.378774,-4.0786989,4.729963,6.007909


* --covariate_file: a file with C + 1 rows and N + 1 columns. N is the number of sample and C is the number of covariate. Note that the order of sample must be the same.
* --N: Number of PEER factor used. If set N = 0, recommanded PEER factors will be generated based on the suggestion of [GTEX](https://gtexportal.org/home/documentationPage) automatically. To obtain the N used, see the `.stdout` file for detail.
* Other values of priors and tolerance.

### Output
4 files and a diagnosis plot.

## Useage:
```sos
    sos run Call_PEER.ipynb call_peer \
        --expression_file Peer_example_data/Peer_example_data.txt \
        --covariate (optional) \
        --cwd /Users/albert29/Documents/Lab/Pipelines/Call_PEER/Output/ \
        --N 10
        ..
```

## Setup and global parameters:

In [2]:
[global]
# The output directory for generated files. MUST BE FULL PATH
parameter: wd = path
cwd = wd
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16384"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container_apex = str
parameter: name = str

# N PEER factors, If do not specify or specified as 0, default values suggested by 
# UCSC (based on different sample size) Will be used
parameter: N = 4
n_of_factor = N

# Default values from PEER:
## The number of max iteration

parameter: max_iter = 1000
iteration = max_iter
## Prior parameters
parameter: Alpha_a = 0.001
parameter: Alpha_b = 0.1
parameter: Eps_a = 0.1
parameter: Eps_b = 10.
## Tolarance parameters
parameter: tol = 0.001
parameter: var_tol = 1e-08
# The molecular phenotype matrix
parameter: molecular_pheno = path
expression_file = molecular_pheno
# The covariate file
parameter: covariate = "None"

# vcf Genotypy list, so that apex factor can run with covariate file input, requirment: vcf shall have the same sample as the cov and molecular phenotype
parameter: genotype_list = path
import pandas as pd
import os
vcf_file = pd.read_csv(genotype_list,sep = "\t")["dir"][0]

In [3]:
sos run Call_PEER.ipynb -h

usage: sos run Call_PEER.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  call_peer

Global Workflow Options:
  --cwd VAL (as path, required)
                        The output directory for generated files. MUST BE FULL
                        PATH
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16384
                        Memory expected
  --numThreads 8 (as int)
                        Number of threads
  --container-PEER 'rticode/peer:1.3_8f2237f'
                        Software container option

Sections
  call_peer:
    Workflow Options:
      --expre

## Main code for the pipeline: PEER

In [2]:
[PEER]

input: expression_file
output: f'{cwd}/{_input:bn}_diagnosis_plot.jpeg',
        f'{cwd}/{name}.PEER.cov'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: container=container_apex, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    
    library(peer)
  
    #### FUNCTIONS: ####  
      
    ## Adapted from the source code of package
    PEER_plotModel_adapted <- function(model){
        par(mfrow=c(2,1))
        bounds = PEER_getBounds(model)
        vars = PEER_getResidualVars(model)
        plot(bounds, type="l", col="red", lwd=2, xlab="Iterations", ylab="Lower bound")
        par(new=TRUE)
        plot(vars,,type="l",col="blue",xaxt="n",yaxt="n",xlab="",ylab="")
        axis(4)
        mtext("Residual variance",side=4,line=3)
        legend("right",col=c("red","blue"),lty=1,legend=c("Lower bound","Residual variance"))
        alpha = PEER_getAlpha(model)
        plot(alpha,xlab="Factors",ylab="Inverse variance of factor weights", type="b", col="blue", lwd=4, xaxp=c(1,length(alpha), length(alpha)-1))
    }
    
    WriteTable <- function(data, filename, index.name) {
      datafile <- file(filename, open = "wt")
      on.exit(close(datafile))
      header <- c(index.name, colnames(data))
      writeLines(paste0(header, collapse = "\t"), con = datafile, sep = "\n")
      write.table(data, datafile, sep = "\t", col.names = F, quote = F)
    }
  
    #### MAIN ####
  
    # Start analysis:
    model <- PEER()
    cat("PEER: loading expression data ... ")
    #pe <- file.path(${_input:r})
    #df <- read.table(pe, header = T, row.names = 1)
    df <- readr::read_delim(${_input:r},"\t")
    mtx = df[,5:ncol(df)]
    rownames(mtx) = df$gene_ID
    # Input: each column is a sample each row is a gene
    # PEER require each row is a sample each column is a gene
    M <- t(as.matrix(mtx, rownames = T))
    cat("done!\n")
    
    # Load covariate file:
    pc <- "${covariate_file}"
    if(pc !=  "None")
        cat("PEER: loading covariate file ...")
        pc <- file.path(${covariate_file:r})
        cov.mat <- read.table(pc, header = T, row.names = 1)
        cov.mat <- t(as.matrix(cov.mat))
        
        if(dim(M)[1] != dim(cov.mat)[1]){
            cat("\n")
            stop("Expression file and covariate file do not have identical number of samples!")}
        cat("done!\n")
        cat("PEER: Input summary:",dim(M)[1], "samples with",dim(M)[2], "genes and", dim(cov.mat)[2],"covariates \n")
        invisible(PEER_setCovariates(model, cov.mat))
    }else{
        cov.mat <- NULL
        cat("PEER: Input summary:",dim(M)[1], "samples with",dim(M)[2], "genes \n")
    }
       
    # Suggest the number of factors to use if no input value
      
    if(${N} == 0){
      # Use suggestion
      if(dim(M)[1] < 150){
          num_factor = 15}else{
              if(dim(M)[1] < 250){num_factor = 30}else{num_factor = 35}
          }
    }else{num_factor = ${N}}
      
    # run PEER
    cat(paste0("Setting initialization parameters ..."))
    
    invisible(PEER_setNk(model, num_factor))
    invisible(PEER_setPhenoMean(model, M))
    invisible(PEER_setPriorAlpha(model,${Alpha_a},${Alpha_b}))
    invisible(PEER_setPriorEps(model,${Eps_a}, ${Eps_b}))
    invisible(PEER_setTolerance(model, ${tol}))
    invisible(PEER_setVarTolerance(model, ${var_tol}))
    invisible(PEER_setNmax_iterations(model, ${max_iter}))
    if(!is.null(cov.mat)){
        invisible(PEER_setCovariates(model, cov.mat))
    }
    cat("Done.\n")
  
    
    cat(paste0("PEER: estimating hidden confounders (", num_factor, ")\n"))
    time <- system.time(PEER_update(model))
    # add relevant row/column names
    
    factor.mat <- PEER_getX(model)  # samples x PEER factors
    weight.mat <- PEER_getW(model)  # omic features x PEER factors
    precision.mat <- PEER_getAlpha(model)  # PEER factors x 1
    resid.mat <- t(PEER_getResiduals(model))  # omic features x samples
    
    peer.var.names <- paste0("peer.factor_", 1:ncol(factor.mat))
    rownames(factor.mat) <- rownames(M)
    colnames(factor.mat) <- peer.var.names
    colnames(weight.mat) <- peer.var.names
    rownames(weight.mat) <- colnames(M)
    rownames(precision.mat) <- peer.var.names
    colnames(precision.mat) <- "alpha"
    precision.mat <- as.data.frame(precision.mat)
    precision.mat$relevance <- 1.0 / precision.mat$alpha
    rownames(resid.mat) <- colnames(M)
    colnames(resid.mat) <- rownames(M)
    
    cat("Exporting results ... ")
      
    # Diagnosis plot:
    output_path <- paste(${cwd:r},"/",${_input:bnr},"_diagnosis_plot.jpeg", sep = "")
    jpeg(output_path)
    invisible(PEER_plotModel_adapted(model))
    invisible(dev.off())
    
    # Write 4 numeric results
    WriteTable(t(factor.mat), file.path("${_output[1]:d}",${_output[1]:b}), "#id")  
    WriteTable(weight.mat, file.path(${cwd:r}, "peer_weights.txt"), "#id")
    WriteTable(precision.mat, file.path(${cwd:r}, "peer_precisions.txt"), "#id")
    WriteTable(resid.mat, file.path(${cwd:r}, "peer_residuals.txt"), "#id")
    cat("Done.\n")

## Main code for the pipeline: APEX

In [3]:
[APEX]
input:  molecular_pheno
output: f'{wd:a}/{name}.APEX.cov.gz',
        f'{wd:a}/{name}.APEX.cov'
task: trunk_workers = 1, trunk_size = 1, walltime = '4h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container_apex,volumes = [f'{wd:ad}:{wd:ad}']
    apex factor \
    --out $[_output[0]:nn] \
    --iter $[iteration] \
    --factors $[n_of_factor] \
    --bed $[_input] \
    --vcf $[vcf_file] $[f'--cov {covariate}' if os.path.exists(covariate) else f'']

    gunzip -f -k $[_output[0]]

## Reference:
* Codes are adapted from [here](https://github.com/RTIInternational/biocloud_docker_tools/blob/master/peer/v1.3/run_peer.R)
* GTEX recommandation of PEER factors is [here](https://gtexportal.org/home/documentationPage)
* Examples by PEER is at [github](https://github.com/PMBio/peer/wiki/Tutorial)