# Apex Prototyping
This notebook Document a prototype workflow for using the Apex to conduct analysis on a list of genes

## Pre-requisites



# Input and Output
## Input

- `--genotype_list` An index text file with two columns of chromosome and the corresponding PLINK bed file.
- `--molecular-pheno`, The text file containing the table describing the molecular phenotype. It shall have regions(genes) as rows and samples as columnes
- `--region_list` The text file with 4 columns specifying the #Chr, P0 (Start position), P1(End position) and names of regions to analyze. The name of the column is not important but the order of the columns. It is also important that the column name of the first column starts with a #. The region_list can can be generated by using another sos pipeline SOS_ROSMAP_gene_exp_processing.ipynb.

## Output

- `uni_weight.RDS` a RDS file that served as the input for the mixture pipeline.
- `susie.RData` a R object containing all the susie output for each of the regions
 

# Command interface 

# Working example


In [1]:
## Test pipeline with test data
## Switch back to abosolute path, otherwise there will be file not found error in step 5



[91mERROR[0m: [91mFailed to locate twas_fusion.ipynb.sos[0m



# Global parameter settings
The section outlined the parameters that can be set in the command interface.

In [5]:
[global]
# Path to the input molecular phenotype data.
parameter: molecular_pheno = path
# Covariate file, in similar format as the molecular_pheno
parameter: cov = path
# Genotype file in vcf/vcf.gz/bcf format
parameter: genotype = path
# An index text file with 4 columns specifying the chr, start, end and names of regions to analyze
parameter: region_list = path
# Path to the work directory of the weight computation: output weights and cache will be saved to this directory.
parameter: wd = path('./')
# Path to store the output folder
parameter: output_path = f'{wd:a}/result'
# Specify the number of jobs per run.
parameter: job_size = 2
# Container option for software to run the analysis: docker or singularity
parameter: container = 'gaow/twas'
# Prefix for the analysis output
parameter: Prefix = ""

# Get a list of all regions
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
# Get a list of all chrs
chrm = set(regions[0])
chrm = list(chrm)

## Fitered for the genes that are to be analyzed

In [1]:
[Expression_Filtering]
input: molecular_pheno,region_list
output: f'{wd:a}/cache/{_input:bn}.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'
R: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("modelr")
    library("purrr")
    pheno = read_delim("$[molecular_pheno]",delim = "\t")
    region = read_delim("$[region_list]",delim = "\t")
    output = inner_join(pheno, region, by = "gene_ID")
    output%>%write_delim("$[_output[0]]",delim = "\t")
    

## Factor Analysis
This step infer hidden covariates and produce residuel expression for downstream analysis

In [225]:
[cis_qtl_1, Factor_Analysis]
output: f'{wd:a}/cache/factor/{_input:bn}.bed'

# Number of latent common factors
parameter: n_of_factor = 5
# Number of factor analysis iterations (0 for PCA).                                         
parameter: iteration = 0
# Factor analysis prior
parameter: p = 0.05
parameter: tau = 0.05
# Parameter for running
parameter: thread = 3
        
        
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, volumes = [f'{geno_file:ad}:{geno_file:ad}']
    cd $[wd]
    mkdir factor
    cd factor
    ##### Get the locus genotypes for $[_regions[3]]
    Apex factor --vcf $[genotype:an] \
    --bed $[_input] \
    --cov $[cov] \
    --out $[Prefix] \
    --threads $[threads] 
    

## LMM Regression  
This step are done to precompute and store a) LMM null models and trait residuals and b) spline terms for LMM genotypic variances to speed up downstream analysis

In [None]:
[cis_qtl_2,LMM]
output: f'{wd:a}/cache/lmm/{_input:bn}.bed'
parameter: window = '[100000]'
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, volumes = [f'{geno_file:ad}:{geno_file:ad}']
    cd $[wd]/cache/
    mkdir lmm
    cd lmm
    Apex lmm --vcf $[genotype:an] \
    --bed $[_input] \
    --cov $[cov] \
    --out $[Prefix] \
    --threads $[threads] \
    --fit-null \
    --save-resid \
    --write-gvar \
    --window=$[window]
    

## QTL Sumstat generation  
This step generate the cis-QTL summary statistics and vcov (covariate-adjusted LD) files for downstream analysis from summary statistics.

In [None]:
[cis_qtl_3,cis_sumstat]
input: ,for_each = "chrm"
output: f'{wd:a}/cache/sumstat/{_input:bn}.bed'
parameter: window = '[100000]'
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, volumes = [f'{geno_file:ad}:{geno_file:ad}']
    cd $[wd]/cache/
    mkdir lmm
    cd lmm
    Apex cis --vcf $[genotype:an] \
    --bed $[_input] \
    --cov $[cov] \
    --out $[Prefix] \
    --threads $[threads] \
    --fit-null \
    --save-resid \
    --write-gvar \
    --window=$[window]

## Conducting meta analysis for multi-variant(TBD) 
Taking the input from 