# Phenotype data preprocessing
This is the data processing pipeline for xqtl workflow, containing the generation of:
1. Factor from expression
2. PCA from genotype
3. GRM from genotype
4. LD from genotype, filtered by grm [TBD]
5. Molecular_phenotype per chrom within selected regions in the format APEX and tensorQTL takes


**FIXME: Hao, I am thinking this kind of notebook (that sits outside these folders) should be of a tutorial nature. It should only contain `sos run` commands interactively with enough text explanations. For those who want to run the default analysis they should work with `master_control.ipynb` and generate the commands to run as is. For those who want to customize the analysis, they should refer to each of these "recipe" and change the parameters here. That should cover 95% user cases. People will read the module notebooks only for learning purpose. For those who want to edit the module notebooks we will consider them developers or at least power users and I expect few of them.**


### Input
The input for this workflow is 1 row of the input recipe file, documenting the path to
1. 1 complete molecular_phenotype data
2. 1 collection of genotype data in plink format, partitioned by chrm
3. 1 file documenting the list of region to be analyzed
4. 


### Output
For each collection, the output is 23 sets of :
1. EXP file for selected region
2. genotype from vcf file

1 sets of
1. PCA + Factor + Covariate file

### Excutable:
This notebook depends on the scripts of multiple other notebook, the directory those are specify by exe_dir

In [2]:
[global]
import os
# Work directory & output directory
parameter: wd = path
# The filename name for output data
parameter: container = 'gaow/twas'
# name for the analysis output
parameter: name = str
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "24h"
# Memory expected
parameter: mem = "60G"
# Number of threads
parameter: numThreads = 20
# Diretory to the executable
parameter: exe_dir = path("~/GIT/ADSPFG-xQTL/workflow")
# yml template
parameter: yml = '/home/hs3163/GIT/ADSPFG-xQTL/code/csg.yml'
# queue for analysis
parameter: queue = "csg"
# Number of submission
parameter: J = 200
# Factor Options
parameter: factor_option: "APEX"

## Temp   
parameter: container_lmm = str
parameter: container_apex = str

parameter: region_list = path
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
# Get the unique chormosome that have regions to be analyzed.
def extract(lst):
    return [item[0] for item in lst]
chrom = list(set(extract(regions)))
chrom.sort()

## Process of molecular phenotype file
This workflow produce a bed.gz+tabix file for all the molecular pheno data that are included in the region list to feed into APEX factor analysis

This workflow also produce a bed.gz+tabix for each chromosome for downstream QTL association analysis(bed.gz+tabiz for Apex and bed.gz for tensorQTL)

In [None]:
[Region_extraction_1]
# Path to the input molecular phenotype data.
parameter: molecular_pheno_whole = path

input: molecular_pheno_whole,region_list
output: molecular_pheno_whole_bed= f'{wd}/Phenotype/{name}.mol_phe.bed.gz',
        molecular_pheno_chr = [f'{wd}/Phenotype/{name}.chr{x}.mol_phe.bed.gz' for x in chrom]
task: trunk_workers = 1, trunk_size = 1, walltime = '4h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
        sos run $[exe_dir]/Data_Processing/Phenotype/Region_extraction.ipynb region_extraction \
            --wd $[wd]/Phenotype/ \
            --container $[container_apex] \
            --name $[name] \
            --numThreads $[numThreads] \
            --molecular_pheno_whole $[molecular_pheno_whole] \
            --region_list $[region_list] \
            -J $[J] -q $[queue] -c $[yml]

In [None]:
[Region_extraction_2]
input: named_output("molecular_pheno_chr")
output: molecular_pheno_chr_list = f'{wd}/Phenotype/{name}.mol_phe.chr_list'
python:expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    import pandas as pd
    df = pd.DataFrame({ "molecular_pheno_chr" : [$[_input:r,]]})
    df.to_csv("$[_output]",sep = "\t",header = 0, index = 0)