# Susie whole genome sequencing

This notebook is a modified version of a twas_fusion_susie.ipynb workflow that conduct univariate susie and univariate regression and get the sumstat in prepared for computing Whole genome prior.

## Aim

The aim of this pipeline is to create input for mixture prior.

## Overview

__Objective__: 
    To Compute the association between expression and SNP for TWAS analysis.

__Background__:
    SNP can modulate the functional phenotypes both directly and by modulating the expression levels of genes. 
Therefore, the integration of expression measurements and a larger scale GWAS summary association statistics will help identify the genes associated with the targeted complex traits. 

__Significance__:
    By applying this method, new candidate genes whose expression level is significantly associated with complex traits can be used in prediction without actually going through the expensive gene expression measurement process. As a relatively small set of gene expression and genotyping, data can be used to impute the expression for a much larger set of phenotyped individuals from their SNP genotype data. 

__Method__:
    The imputed expression can then be viewed as a linear model of genotypes with _weights based on the correlation between SNPs and gene expression__ in the training data while accounting for linkage disequilibrium (LD) SNPs. We then correlated the imputed gene expression to the trait to perform a transcriptome-wide association study (TWAS) and identify significant expression-trait associations. 
 
The weights are computed via various models: blup, bslmm, lasso,top1, and enet. BLUP(best linear unbiased predictors)/bslmm(Bayesian linear mixed model) are conducted using gemma, lasso using plink, and enet (elastic net) using cv.glmnet function in R.

Before the weight calculation, the heritability of each gene are computed using GCTA; genes with insignificant heritability were screened out.

## Pre-requisites

We provide a container image `docker://gaow/twas` that contains all software needed to run the pipeline. If you would like to configure it by yourself, please make sure you install the following software before running this notebook:
- GCTA
- PLINK
- GEMMA
- Modified `Fusion.compute_weights.R` scripts can be found [in this github repo](https://github.com/cumc/neuro-twas/blob/master/Workflow/FUSION.compute_weights.R).
- The original `FUSION.assoc_test.R` script can be found [in the author's github repo](https://github.com/gusevlab/fusion_twas).

You need to make both `Fusion.compute_weights.R` and `FUSION.assoc_test.R` executable with `Rscript` command. See [this line]() for an example.

# Input and Output
## Input
- `--gwas_sumstat` The GWAS sum stat text file documenting the associations between the SNP and the disease. It shall contain at least four column: the SNP rsID, the effect allele, the other allele, and the Z-score describing the relationship between the SNP and the disease. 
- `--genotype_list` An index text file with two columns of chromosome and the corresponding PLINK bed file.
- `--molecular-pheno`, The text file containing the table describing the molecular phenotype. It shall have regions(genes) as rows and samples as columnes
- `--region_list` The text file with 4 columns specifying the #Chr, P0 (Start position), P1(End position) and names of regions to analyze. The name of the column is not important but the order of the columns. It is also important that the column name of the first column starts with a #. The region_list can can be generated by using another sos pipeline SOS_ROSMAP_gene_exp_processing.ipynb.
- `--window` the region span from the specify start and end site for the cis-gene. If the gene expression only have one position column, set the window to a large number like 5E5.
- `--window` the region span from the specify start and end site for the cis-gene. If the gene expression only have one position column, set the window to a large number like 5E5.





**FIXME: here are also some text i dug up later in your workflow codes. Please explain them up-front. We would not expect users to read this notebook beyong the "Working example " section**
This probably is served as a self reminder of what each file do and therefore not expected the user to see, so maybe move them back?

```
1. The gene expression pheno type, a three column table for each genes, with the first two columns specifing the family ID and within family ID of the samples. In the current case where all samples are unrelated, the first two columns are simply sample ID. The third column is the actual gene expression value.
2. The plink trio file for each specific genes, containing only the snps corresponding th the regions whose expression are recorded. In particular, the snp are filtered according to the genetics regions outlined by Position+/-windows.
```



## Output

- `uni_weight.RDS` a RDS file that served as the input for the mixture pipeline.
- `susie.RData` a R object containing all the susie output for each of the regions
 

# Command interface 

In [2]:
!sos run twas_fusion.ipynb -h

usage: sos run twas_fusion.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  twas_fusion
  association_test

Global Workflow Options:
  --molecular-pheno VAL (as path, required)
                        Path to the input molecular phenotype data.
  --gwas-sumstat VAL (as path, required)
                        Path to GWAS summary statistics data (association
                        results between SNP and disease in a GWAS)
  --genotype-list VAL (as path, required)
                        An index text file with two columns of chromosome and
                        the corresponding PLINK bed file.
  --region-list VAL (as path, required)
                        An index text file 

# Working example
The MWE file is availble at :
"https://www.synapse.org/#!Synapse:syn24179064/files/"

The time it take to run this MWE shall be around 2 minutes. Pay extra attention to the gene_start and gene_end position  when using following command on gene_exp file that are not this MWE. Also, when there is too few or too many genes that passed the heritability check, consider increasing or decreasing the --window options. 

In [1]:
## Test pipeline with test data
## Switch back to abosolute path, otherwise there will be file not found error in step 5
sos run susie-wgs-prior.ipynb susie \
  --molecular-pheno ./molecular_phenotype \
  --wd ./ \
  --genotype_list ./geno_dir\
  --region_list ./region_list \
  --region_name 1 \
  --data_start 5 \
  --window 500000 \
  --container /mnt/mfs/statgen/containers/twas_latest.sif 


[91mERROR[0m: [91mFailed to locate twas_fusion.ipynb.sos[0m



Keyboard Interrupt


# Global parameter settings
The section outlined the parameters that can be set in the command interface.

In [5]:
[global]
# Path to the input molecular phenotype data.
parameter: molecular_pheno = path
# An index text file with two columns of chromosome and the corresponding PLINK bed file.
parameter: genotype_list = path
# An index text file with 4 columns specifying the chr, start, end and names of regions to analyze
parameter: region_list = path
# Path to the work directory of the weight computation: output weights and cache will be saved to this directory.
parameter: wd = path('./')
# Specify the directory to save fitted weights
parameter: weights_path = f'{wd:a}/WEIGHTS'
# Path to list of weights
parameter: weights_list = f'{weights_path}/{molecular_pheno:bn}.weights_list.txt'
# Path to store the output folder
parameter: output_path = f'{wd:a}/result'
# Specify the column in the molecular_pheno file that contains the name of the region of interest
parameter: region_name = int
# Specify the column in the molecular_pheno file where the actual data start
parameter: data_start = int
# Specify the scanning window for the up and downstream radius to analyze around the region of interest, in units of Kb
parameter: window = 50000
# Specify the number of jobs per run.
parameter: job_size = 2

# Container option for software to run the analysis: docker or singularity
parameter: container = 'gaow/twas'

# Propotion of samples set into testing, set to zero if no cv are needed.
parameter: testing_prop = 0.2
# Number of training & testing samples used
parameter: cv_times = 2

# parameters for the susie pipelines.
parameter: causal_variables_L = 10
parameter: scaled_prior_variance = 0.1

# Get regions of interest to focus on.
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]

geno_inventory = dict([x.strip().split() for x in open(genotype_list).readlines() if x.strip() and not x.strip().startswith('#')])

import os
def get_genotype_file(chrom, genotype_list, geno_inventory):
    chrom = f'{chrom}'
    if chrom.startswith('chr'):
        chrom = chrom[3:]
    if chrom not in geno_inventory:
        geno_file = f'{chrom}'
    else:
        geno_file = geno_inventory[chrom]
    if not os.path.isfile(geno_file):
        # relative path
        if not os.path.isfile(f'{genotype_list:ad}/' + geno_file):
            raise ValueError(f"Cannot find genotype file {geno_file}")
        else:
            geno_file = f'{genotype_list:ad}/' + geno_file
    return path(geno_file)

## Partition of the molecular phenotype for each genes
This step extracts the molecular phenotype for each gene and transposes them into the formats needed in the follow-up analysis.

In [11]:
[susie_1,susie_cv_1]
input: molecular_pheno, for_each = "regions"
output: f'{wd:a}/cache/{_input:bn}.{_regions[3]}.exp',
        f'{wd:a}/cache/{_input:bn}.{_regions[3]}.pheno'
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = '6G', tags = f'{step_name}_{_output[0]:bn}'
R: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    # Get the line number for the region in the file
    line_num = system("awk '($$[region_name]==\"$[_regions[3]]\") {print NR}' $[_input]", intern=T)
    if (length(line_num) == 0){
      stop( "Cannot find $[_regions[3]] in column $[region_name]  $[_input]")}
    yi <- data.table::fread(file = $[_input:r], skip = as.integer(line_num) - 1, nrows = 1)
    samplenames_yi <- data.table::fread(file = $[_input:r], skip = 0, nrows = 1)
    colnames(yi) <- colnames(samplenames_yi)
    readr::write_tsv(yi, path = "$[_output[0]]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")
    yi <- as.data.frame(yi[, $[data_start]:ncol(yi), drop = FALSE])
    yj <- rbind(colnames(yi),colnames(yi),yi)
    readr::write_tsv(as.data.frame(t(yj)), path = "$[_output[1]]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")

## Construction of Plink trio for each gene
This step constructs the plink file for each gene based on the output of previous steps. Specifically it:

1. Selects only the SNPs within the start and end position of the corresponding region (gene)
2. Replaces the Phenotype value (last column) of the .fam based on the input


In [225]:
[susie_2,susie_cv_2]
input: group_by = 2, group_with = 'regions'
output: f'{_input[0]:n}.bed',
        f'{_input[0]:n}.bim',
        f'{_input[0]:n}.fam'

# look up for genotype file
geno_file = get_genotype_file(_regions[3],genotype_list,geno_inventory)

parameter: extract_snp = f'{geno_file:an}.bim'
parameter: exclude_snp = "./."
parameter: keep_sample = f'{_input[1]}'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '6G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, volumes = [f'{geno_file:ad}:{geno_file:ad}']
    ##### Get the locus genotypes for $[_regions[3]]
    plink --bfile $[geno_file:an] \
    --pheno $[_input[1]] \
    --make-bed \
    --out $[_output[0]:n] \
    --chr $[_regions[0]] \
    --from-bp $[int(_regions[1]) - window ] \
    --to-bp $[int(_regions[2]) + window ] \
    --keep $[keep_sample] \
    --extract $[extract_snp]
    --exclude $[exclude_snp]
    --allow-no-sex || true
    touch $[_output]

## Conducting univariate test for all the genes and save the sumstat

In [None]:
#Susie test
[susie_3,susie_cv_3]
input:  group_by = 3, group_with = 'regions'
output: f'{wd:a}/susie/{_input[0]:bn}.susie.model.RData',
        f'{wd:a}/susie/{_input[0]:bn}.uni_weight.rds'

import os
skip_if(os.path.getsize(f'{_input[0]}') == 0)
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = "6G" , tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container,volumes = [f'{wd:a}:{wd:a}']
    library("dplyr")
    library("tibble")
    library("susieR")
    library("plink2R")
    library("readr")
    library("modelr")
    library("purrr")
    library("abind")

    # Define functions
    ###Functions to compute MAF and missing genotype rate
    compute_maf <- function(geno){
      f <- mean(geno,na.rm = TRUE)/2
      return(min(f, 1-f))
    }
    
    compute_missing <- function(geno){
      miss <- sum(is.na(geno))/length(geno)
      return(miss)
    }
    
    mean_impute <- function(geno){
      f <- apply(geno, 2, function(x) mean(x,na.rm = TRUE))
      for (i in 1:length(f)) geno[,i][which(is.na(geno[,i]))] <- f[i]
      return(geno)
    }
    
    is_zero_variance <- function(x) {
      if (length(unique(x%>%na.omit()))==1) return(T)
      else return(F)
    }
    ### Filter X matrix
    filter_X <- function(X, missing_rate_thresh, maf_thresh) {
      rm_col <- which(apply(X, 2, compute_missing) > missing_rate_thresh)
      if (length(rm_col)) X <- X[, -rm_col]
      rm_col <- which(apply(X, 2, compute_maf) < maf_thresh)
      if (length(rm_col)) X <- X[, -rm_col]
      rm_col <- which(apply(X, 2, is_zero_variance))
      if (length(rm_col)) X <- X[, -rm_col]
      return(mean_impute(X))}
      
    ###Function to impute the missing X with means and then scale and center X
      impute_and_transform = function(genos,impute = TRUE){
      tmp = genos
      if(impute == TRUE){
      for(i in 1:ncol(tmp)){
        tmp[,i]=coalesce(tmp[,i],mean(tmp[,i]%>%na.omit()))%>%scale()}
        return(tmp)
      } else {
    for(i in 1:ncol(tmp)){
        tmp[,i]=tmp[,i]%>%scale()}
        return(tmp)}}
        
    ###Function to impute the weight and standard errors
    
    mm_regression = function(X, Y, Z=NULL,center=TRUE,scale=TRUE) {
    ## HS: Make sure X and Y is matrixs as well otherwise the ncol(Y) give error
    X = as.matrix(X)
    Y = as.matrix(Y)
    if (!is.null(Z)) {
    Z = as.matrix(Z)
    }
    if(any(is.na(Y))){
    reg = lapply(seq_len(ncol(Y)), function (i) simplify2array(susieR:::univariate_regression(X, Y[,i], Z, center, scale)))
    reg = do.call(abind, c(reg, list(along=0)))
    # return array: out[1,,] is betahat, out[2,,] is shat
    out = aperm(reg, c(3,2,1))
    #HS Force dimension for the matrix slice.
    out = list(bhat = as.matrix(out[1,,]), sbhat=as.matrix(out[2,,]))
    }else{
    out = univariate_regression(X, Y, Z, center = F, scale = F)
    #HS Original out dont has betahat instead of bhat, fixed now.
    out$bhat = as.matrix(out$betahat)
    out$sbhat = as.matrix(out$sebetahat)
    }
    if (!is.null(colnames(X))) {
    rownames(out$bhat) = colnames(X)
    rownames(out$sbhat) = colnames(X)
    }
    if (!is.null(colnames(Y))) {
    colnames(out$bhat) = colnames(Y)
    colnames(out$sbhat) = colnames(Y)
    }
    # `out` is a list of bhat and sbhat
    return(out)
    }

    # Load data and transform
    genos = read_plink("$[_input[0]:n]")
  
    # Filter X by 0.1 NA and 0.01 MAF, and then transfrom X
    
    X_ftr = filter_X(genos$bed,0.1,0.01)
    X = impute_and_transform(X_ftr,impute = FALSE)
    
    Y = genos$fam%>%as_tibble()%>%mutate(name = paste(V1,":",V2,sep = ""))
    
    # Make sure X and Y have the same order
    Y = Y%>%arrange(match(name,rownames(X)))%>%select(V6)
    
    
    # Center and scale Y 
    
    Y = impute_and_transform(Y, impute = FALSE )
    Y = Y$V6
    
    # Susie with full samples
    full_model = susie(X, Y,
                  L = $[causal_variables_L],
                  estimate_residual_variance = TRUE, 
                  estimate_prior_variance = FALSE,
                  scaled_prior_variance = $[scaled_prior_variance])
                  

    # Get sumstat from the univariate regression
    
    uni = mm_regression(X,Y,center=FALSE,scale=FALSE)
    
    # Save the sumstat objects
    
    uni%>%saveRDS("$[_output[1]]")
    
    # Save the model for future use
    full_model$X = X
    full_model$Y = Y
    save(full_model, file="$[_output[0]]")