# GWAS sumstat Processing

## Pre-requisites

We provide a container image `docker://gaow/twas` that contains all software needed to run the pipeline. If you would like to configure it by yourself, please make sure you install tidyverse

# Input and Output
## Input

- `--GWAS sumstat` An sumstat text file at least with CHR, bp, SNP name, bhat and sbhat, chr and bp column shall be named CHR and BP.

- `--region_list` The text file with 4 columns specifying the #Chr, P0 (Start position), P1(End position) and names of regions to analyze. The name of the column is not important but the order of the columns. It is also important that the column name of the first column starts with a #. The region_list can can be generated by using another sos pipeline SOS_ROSMAP_gene_exp_processing.ipynb.

## Output

- `uni_weight.RDS` a RDS file that served as the input for the mixture pipeline.
 

# Command interface 

In [1]:
!sos run GWAS_Gene_annotation.ipynb -h

usage: sos run GWAS_Gene_annotation.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  complete_data_analysis
  hsq
  susie
  susie_cv
  susie_get_info

Global Workflow Options:
  --analysis-units VAL (as path, required)
                        single column file each line is the data filename
  --data-dir VAL (as path, required)
                        Path to data directory
  --data-suffix VAL (as str, required)
                        data file suffix
  --wd output (as path)
                        Path to work directory where output locates
  --prior . (as path)
                        Path to prior data file: an RDS file with `U` and `w`
                        

# Working example
The MWE file is availble at :
"https://www.synapse.org/#!Synapse:syn24179064/files/"

The time it take to run this MWE shall be around 2 minutes. Pay extra attention to the gene_start and gene_end position  when using following command on gene_exp file that are not this MWE. Also, when there is too few or too many genes that passed the heritability check, consider increasing or decreasing the --window options. 

In [1]:
## Test pipeline with test data
## Switch back to abosolute path, otherwise there will be file not found error in step 5
nohup sos run ~/GIT/neuro-twas/workflow/GWAS_Gene_annotation.ipynb \
  --wd "/mnt/mfs/statgen/neuro-twas/TWAS_sumstat/" \
  --GWAS_sumstat "/mnt/mfs/statgen/neuro-twas/TWAS_sumstat/Data/GCST90012877_buildGRCh37_colrenamed.txt"\
  --region_list  "/mnt/mfs/statgen/neuro-twas/mv_wg/wg_rds_list_final" \
  --window 500000 \
  --container /mnt/mfs/statgen/containers/twas_latest.sif &


[91mERROR[0m: [91mFailed to locate twas_fusion.ipynb.sos[0m



# Global parameter settings
The section outlined the parameters that can be set in the command interface.

In [5]:
[global]
# An index text file with 4 columns specifying the chr, start, end and names of regions to analyze
parameter: region_list = path
# Path to the work directory of the weight computation: output weights and cache will be saved to this directory.
parameter: wd = path('./')# Specify the scanning window for the up and downstream radius to analyze around the region of interest, in units of Kb
parameter: window = 500000
# Prefix for the output file
parameter: prefix = "geneTpmResidualsAgeGenderAdj_rename"

# Specify the number of jobs per run.
parameter: job_size = 2

# Container option for software to run the analysis: docker or singularity
parameter: container = 'gaow/twas'

# Get regions of interest to focus on.
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]

## Partition of the GWAS sumstat for each genes
This step extracts the molecular phenotype for each gene and transposes them into the formats needed in the follow-up analysis.

In [None]:
[GWAS_Annotation]
# GWAS file, chr and bp column shall be named CHR and BP
parameter: GWAS_sumstat = path
# Number of column for beta, SE_beta, and SNP names  
parameter: beta_col = 5
parameter: se_beta_col = 8
parameter: snp_col = 9
input: GWAS_sumstat, for_each = "regions"
output: f'{wd}/gwas_sumstat/{prefix}.{_regions[0]}.uni_weight.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '10G', tags = f'{step_name}_{_output:bn}'
R: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("modelr")
    library("purrr")
    library("tidyr")
    sumstat = read_delim("$[_input]","\t")
    sumstat_ftr = sumstat%>%mutate(Z = sumstat[,$[beta_col]]/sumstat[,$[sbeta_col]])%>%
    filter(CHR == $[_regions[1]], BP >=  $[_regions[2]] - 500000, BP <= $[_regions[2]] + 500000)%>%
    ## remove all the NA,nan,Inf sumstat
    filter(!is.na(Z) && !is.nan(Z) && is.finite(Z))
    output = list()
    output$bhat = as.matrix(sumstat_ftr[,$[beta_col]])
    rownames(output$bhat) = sumstat_ftr[,$[snp_col]]%>%unlist%>%as.character
    output$sbhat = as.matrix(sumstat_ftr[,$[se_beta_col]])
    rownames(output$sbhat) = sumstat_ftr[,$[snp_col]]%>%unlist%>%as.character
    ## remove all the NA,nan,Inf sumstat
    
  
    output%>%saveRDS("$[_output]")