# Factor analysis using Bi-Cross validation
## Overview

This workflow is intend to perform factor analysis module in [APEX toolkit](https://corbinq.github.io/apex/doc/mode_factor/)

## Input and output:

Format of main inputs and main outputs will be same as that in [PEER workflow](https://cumc.github.io/xqtl-pipeline/pipeline/data_preprocessing/covariate/PEER_factor.html):


### Inputs:

#### Required input:

* `--molecular_pheno`: The expression file is used only for factor analysis. Currently, apex-factor assumes sample size is less than the number of molecular traits. Molecular trait data must be stored in `bed.gz` file format with indexing. Note that you must supply **compressed** version of `bed.gz` file and indexed. (i.e. run `bgzip your_file.bed && tabix your_file.bed.gz`). An example of `bed.gz` file will be:

In [40]:
readr::read_delim("Example_data/example_data.bed.gz",show_col_types= F)[1:3,1:8]

#chr,start,end,gene_ID,Sample1,Sample2,Sample3,Sample4
<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,100,Gene1,-1.627455,1.684725,-0.0807924,1.510522
1,101,200,Gene2,-12.626715,17.746163,-14.2519447,5.254536
1,201,300,Gene3,3.33739,-7.259074,6.2643526,-1.375911


* `--name`: The name of your final output. The final name of the output will be `{name}.APEX.cov`

#### Optional inputs:

* `--covariate`: The covariate file supplied. Note that the sample names should be same (case sensitive: i.e. `Sample 1` is **not** equal to `sample1`). This input will not be truly included in the calculation of APEX factors(see [here](https://corbinq.github.io/apex/doc/mode_factor/) for detail). Factors obtained from expression will sololy will be used to calculate factors and the result will be concatenated with the input covariate file. A sample covariate file is illustrated here: 

In [1]:
readr::read_delim("Example_data/example_cov.txt",show_col_types= F)[,1:8]

#id,Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Cov1,1.6243454,-2.301539,1.462108,-1.0998913,-1.100619,-0.6837279,-0.6916608
Cov2,-0.6117564,1.744812,-2.060141,-0.1724282,1.144724,-0.1228902,-0.3967535


* `--N`:The number of latent factor to calculate. Default is **FIXME HERE**

* `--iteration`: The number of iteration used. Default is 3.  APEX **do no** recommend to run until converge.

* Other apex defaults

### Outputs:

Outputs of this workflow will be 

* Two files `{name}.APEX.cov.gz` and `{name}.APEX.cov` of covariates imputed by APEX main

* A file `{name}.full_APEX.cov` for downstream analysis (i.e. concatenated the covariates)

## Useage:
```sos
    sos run BiCV_factor.ipynb APEX \
        --name NAME_HERE \
        --container_apex gouwh/apex:1.0.0 \
        --molecular_pheno PEER_example_data/Peer_example_data.bed.gz \
        --covariate PEER_example_data/Peer_example_cov.txt
        ...
```

In [1]:
[global]
# The output directory for generated files. MUST BE FULL PATH
parameter: wd = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container_apex = str
parameter: name = str

# N PEER factors, If do not specify or specified as 0, default values suggested by 
# UCSC (based on different sample size) Will be used
parameter: N = 10
n_of_factor = N

# Default values from PEER:
## The number iteration: default value is 3.
parameter: iteration = 3

# The molecular phenotype matrix, in bed, after annotation
parameter: molecular_pheno = path
# The covariate file
parameter: covariate = "none"

# Other APEX defaults:
parameter: priorp = 0
parameter: priortau = 1

In [25]:
sos run BiCV_factor.ipynb -h

usage: sos run BiCV_factor.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  APEX
  APEXX

Global Workflow Options:
  --wd VAL (as path, required)
                        The output directory for generated files. MUST BE FULL
                        PATH
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16384
                        Memory expected
  --numThreads 8 (as int)
                        Number of threads
  --container-apex VAL (as str, required)
                        Software container option
  --name VAL (as str, required)
  --N 10 (as int)
        

In [1]:
# APEX factor analysis main
[APEX_1]
input:  molecular_pheno
output: f'{wd}/{name}.APEX.cov.gz',
        f'{wd}/{name}.APEX.cov'
task: trunk_workers = 1, trunk_size = 1, walltime = '4h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container_apex
    
    apex factor \
    --out $[_output[0]:nn] \
    --iter $[iteration] \
    --factors $[n_of_factor] \
    --bed $[_input] \
    --prior-p 0 \
    --prior-tau 1

    gzip -dk $[_output[0]]

# Combine the covaraite and the factor
[APEX_2]
input: f'{wd}/{name}.APEX.cov'
output:f'{wd}/{name}.full_APEX.cov'
task: trunk_workers = 1, trunk_size = 1, walltime = '4h',  mem = '20G', tags = f'{step_name}_{_output:bn}'
R: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container = container_apex
    
    # Function adapted from PEER
    WriteTable <- function(data, filename, index.name) {
      datafile <- file(filename, open = "wt")
      on.exit(close(datafile))
      header <- c(index.name, colnames(data))
      writeLines(paste0(header, collapse = "\t"), con = datafile, sep = "\n")
      write.table(data, datafile, sep = "\t", col.names = F, quote = F)
    }
  
    cov_impute <- read.delim($[_input:r], row.names = 1, check.names = F)
    
    pc <- "$[covariate]"
    
    if(pc == "none"){
      WriteTable(cov_impute, $[_output:r], "#id")
      }else{
      cov_origin <- read.delim(pc, row.names = 1, check.names = F)
      common_sample <- intersect(names(cov_origin), names(cov_impute))
      if(length(common_sample) == 0){
          stop("No common samples! ")
      }else{
          cov_impute <- cov_impute[,common_sample]
          cov_origin <- cov_origin[,common_sample]
          print(cov_origin)
          cov_out <- rbind(cov_impute,cov_origin)
          WriteTable(cov_out, $[_output:r], "#id")
      }
    }
    

## Minimum working example

In [45]:
sos run BiCV_factor.ipynb APEX \
    --wd ./apex_out \
    --name apex0 \
    --container_apex gouwh/apex:1.0.0 \
    --molecular_pheno Example_data/example_data.bed.gz \
    --iteration 3 \
    --covariate Example_data/example_cov.txt

INFO: Running [32mAPEX_1[0m: APEX factor analysis main
HINT: Pulling docker image gouwh/apex:1.0.0
HINT: Docker image gouwh/apex:1.0.0 is now up to date
INFO: [32mAPEX_1[0m is [32mcompleted[0m.
INFO: [32mAPEX_1[0m output:   [32mapex_out/apex0.APEX.cov.gz apex_out/apex0.APEX.cov[0m
INFO: Running [32mAPEX_2[0m: Combine the covaraite and the factor
INFO: [32mAPEX_2[0m is [32mcompleted[0m.
INFO: [32mAPEX_2[0m output:   [32mapex_out/apex0.full_APEX.cov[0m
INFO: Workflow APEX (ID=w1ce245c1631ad966) is executed successfully with 2 completed steps.


In [46]:
tree ./apex_out

[01;34m./apex_out[00m
├── apex0.APEX.cov
├── [01;31mapex0.APEX.cov.gz[00m
├── apex0.APEX.cov.gz.stderr
├── apex0.APEX.cov.gz.stdout
├── apex0.full_APEX.cov
├── apex0.full_APEX.cov.stderr
└── apex0.full_APEX.cov.stdout

0 directories, 7 files


In [47]:
cat ./apex_out/apex0.APEX.cov.gz.stdout

Using 4 threads.
[?25l1 present in file.

    Covariates can be specified using --cov FILE. Use --rankNormal
    to rank-normal (aka, inverse-normal) transform traits, and use
    --rankNormal-resid for trait residuals.
Found 200 samples in expression bed file ... 
Processed expression for 100 genes across 200 samples.
Scaling expression traits ... 
Estimating latent factors ... 
Done.
[?25h