# Covariate Data Preprocessing

This notebook contains workflow of processing covariate files and computes PCA-derived covariates from phenotype data.

#### Miniprotocol Timing
This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate. 

Timing < X minutes

## Overview

This workflow is an application of the covariate related sections from the xQTL project pipeline.

1. `covariate_formatting.ipynb` (): 
2. `covariate_hidden_factor.ipynb` ():


## Data Input
- `output/protocol_example.protein.bed.gz`
- PCs from genotypes genereated in the [genotype_pca](https://github.com/cumc/brain-xqtl-analysis/tree/main/analysis/Wang_Columbia/ROSMAP/pqtl/genotype_pca) step.
- Fixed covarate file including information such as sex, age at death, pmi etc

## Data Output
- `output/` This contains all covariates from Genotype PCs, known covariates, and hidden factors.

## Steps

### i. Merge Covariates and Genotype PCA
You can edit the total amount of variation you want your PCs to explain by editing the `--k ` parameter. In this example, we chose 80%.

In [None]:
sos run pipeline/covariate_formatting.ipynb merge_genotype_pc \
    --cwd output/covariate \
    --pcaFile output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.rds \
    --covFile  input/protocol_example.samples.tsv \
    --tol_cov 0.4  \
    --k `awk '$3 < 0.8' output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.scree.txt | tail -1 | cut -f 1 ` \
    --container containers/bioinfo.sif

### ii. Compute residule on merged covariates and perform hidden factor analysis

In [None]:
sos run pipeline/covariate_hidden_factor.ipynb Marchenko_PC \
   --cwd output/covariate \
   --phenoFile output/phenotype/protocol_example.protein.bed.gz  \
   --covFile output/covariate/protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.gz \
   --mean-impute-missing \
   --container containers/PCAtools.sif

## Anticipated Results