# Covariate data preprocessing

This module documents output from the factor analysis section of command generator MWE and explained the purpose for each of the command. The file used in this page can be found at [here](https://drive.google.com/drive/folders/16ZUsciZHqCeeEWwZQR46Hvh5OtS8lFtA?usp=sharing).

**Each commands in the factor analysis tutorials will be generated once per theme. The MWE is considered a one theme analysis** 



## Merge covariate
This step generate a concatanated principle component + covaraiate (pc-cov) matrixs so that we can use it to generate a residual phenotype, as outlined in the [phenotype processing]() page

The tolerance for sample-wide NA rate of any covariates/pc are specified by `tol_cov`.  -1 means quit, otherwise for covariate with missing rate larger than tol_cov will be removed, with missing rate smaller than tol_cov will be mean_imputed.

The first k PCs that satisfied >70% PVE will be merged with the covariates for downstream analysis and for estimating the residual expression in the `phenotype_preprocessing` section.

In [None]:
sos run pipeline/covariate_formatting.ipynb merge_pca_covariate \
        --cwd output/data_preprocessing/MWE/covariates \
        --pcaFile data_preprocessing/MWE/pca/MWE.MWE.related.filtered.extracted.pca.projected.rds \
        --covFile  MWE.covariate.cov.gz \
        --tol_cov 0.3  \
        --k `awk '$3 < 0.7' output/data_preprocessing/MWE/pca/MWE.MWE.related.filtered.extracted.pca.projected.scree.txt | tail -1 | cut -f 1 ` \
        --container containers/bioinfo.sif

## Factor Analysis
The [residual expression]() will be used to conduct the factor analysis using either [BiCV (APEX)](https://corbinq.github.io/apex/) or [PEER (MOFA2)](https://biofam.github.io/MOFA2/)(https://biofam.github.io/MOFA2/index.html). The purpose of factor analysis is to uncovered un-measured factor embedded in the phenotype data, potentiall factor could be experiment batch effct, or unmeasure morbidity of the sampels .etc.

The primary factor analysis method we used in our analysis is PEER.

In [None]:
sos run pipeline/BiCV_factor.ipynb BiCV \
   --cwd output/data_preprocessing/MWE/covariates \
   --phenoFile data_preprocessing/MWE/phenotype/MWE.log2cpm.MWE.covariate.cov.MWE.MWE.related.filtered.extracted.pca.projected.resid.bed.gz  \
   --container containers/APEX.sif  \
   --walltime 24h \
   --numThreads 8 \
   --iteration 1000 \
   --N 10

In [None]:
sos run pipeline/PEER_factor.ipynb PEER \
   --cwd output/data_preprocessing/MWE/covariates \
   --phenoFile data_preprocessing/MWE/phenotype/MWE.log2cpm.MWE.covariate.cov.MWE.MWE.related.filtered.extracted.pca.projected.resid.bed.gz  \
   --container containers/PEER.sif  \
   --walltime 24h \
   --numThreads 8 \
   --iteration 1000 \
   --N 10

## Merged factor & covariates
The factors estimated above will be concatanated with pc-cov matrix for downstream analysis.

In [None]:
sos run pipeline/covariate_formatting.ipynb merge_factor_covariate \
        --cwd output/data_preprocessing/MWE/covariates \
        --factorFile data_preprocessing/MWE/covariates/MWE.log2cpm.MWE.covariate.cov.MWE.MWE.related.filtered.extracted.pca.projected.resid.bed.PEER.cov.gz \
        --covFile  data_preprocessing/MWE/covariates/MWE.covariate.cov.MWE.MWE.related.filtered.extracted.pca.projected.gz \
        --container containers/bioinfo.sif