# Covariate data preprocessing

This notebook contains workflow of processing covariate files and computes PCA-derived covariates from phenotype data.

## Methods overview

This workflow is an application of the covariate related sections from the xQTL project pipeline.

## Data Input
- `output/protocol_example.protein.bed.gz`
- PCs from genotypes genereated in the [genotype_pca](https://github.com/cumc/brain-xqtl-analysis/tree/main/analysis/Wang_Columbia/ROSMAP/pqtl/genotype_pca) step.
- Fixed covarate file including information such as sex, age at death, pmi etc


In [1]:
import pandas as pd
data_cov= pd.read_csv("protocol_example/protocol_example.samples.tsv", sep='\t')
data_cov.head(5)

Unnamed: 0,sample,age,sex,pmi
0,sample_384,88,1,9.0
1,sample_597,88,1,3.166667
2,sample_598,85,0,4.416667
3,sample_599,84,0,7.916667
4,sample_600,82,0,3.916667


## Data Output
- `output/` This contains all covariates from Genotype PCs, known covariates, and hidden factors.

### Merge covariates and genotype PCA

First, check how many genotype PC we might want to include,

In [1]:
awk '$3 < 0.8' output/protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.prune.pca.scree.txt| tail -1 | cut -f 1

awk: fatal: cannot open file `output/protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.prune.pca.scree.txt' for reading (No such file or directory)


Here we see 15 PC that will explain 80% variation in the data. Let's include 15 PC in this case. In practice it is suggested that you discuss with your collaborator and/or PI about the choice of PC given results from the previous PCA.

In [6]:
sos run pipeline/covariate_formatting.ipynb merge_genotype_pc \
    --cwd output/covariate \
    --pcaFile output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.rds \
    --covFile  protocol_example/protocol_example.samples.tsv \
    --tol_cov 0.4  \
    --k 15 \
    --container containers/bioinfo.sif

INFO: Running [32mmerge_genotype_pc[0m: 
INFO: [32mmerge_genotype_pc[0m is [32mcompleted[0m.
INFO: [32mmerge_genotype_pc[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/covariate/protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.gz[0m
INFO: Workflow merge_genotype_pc (ID=wdba3531b2c9cee95) is executed successfully with 1 completed step.


### Compute residule on merged covariates and perform hidden factor analysis
This step will compute residual on merged covariates(`Marchenko_PC_1`) and perform hidden factor analysis(`Marchenko_PC_2`)

In [11]:
sos run pipeline/covariate_hidden_factor.ipynb Marchenko_PC \
   --cwd output/covariate \
   --phenoFile output/phenotype/protocol_example.protein.bed.gz  \
   --covFile output/covariate/protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.gz \
   --mean-impute-missing \
   --container containers/PCAtools.sif

INFO: Running [32mcomputing residual on merged covariates[0m: 
INFO: [32mcomputing residual on merged covariates[0m is [32mcompleted[0m.
INFO: [32mcomputing residual on merged covariates[0m output:   [32moutput/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.residual.bed.gz[0m
INFO: Running [32mMarchenko_PC_2[0m: 
INFO: [32mMarchenko_PC_2[0m is [32mcompleted[0m.
INFO: [32mMarchenko_PC_2[0m output:   [32moutput/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz[0m
INFO: Workflow Marchenko_PC (ID=w180bc4d94fbd6568) is executed successfully with 2 completed steps.
