# TWAS multivariate susie

This notebook implements a TWAS analysis workflow using multivariate susie.

## Aim

TBD

## Overview (TBD)

__Objective__: 
    To Compute the association between expression and SNP for TWAS analysis.

__Background__:
    SNP can modulate the functional phenotypes both directly and by modulating the expression levels of genes. 
Therefore, the integration of expression measurements and a larger scale GWAS summary association statistics will help identify the genes associated with the targeted complex traits. 

__Significance__:
    By applying this method, new candidate genes whose expression level is significantly associated with complex traits can be used in prediction without actually going through the expensive gene expression measurement process. As a relatively small set of gene expression and genotyping, data can be used to impute the expression for a much larger set of phenotyped individuals from their SNP genotype data. 

__Method__:
    The imputed expression can then be viewed as a linear model of genotypes with _weights based on the correlation between SNPs and gene expression__ in the training data while accounting for linkage disequilibrium (LD) SNPs. We then correlated the imputed gene expression to the trait to perform a transcriptome-wide association study (TWAS) and identify significant expression-trait associations. 
 
The weights are computed via various models: blup, bslmm, lasso,top1, and enet. BLUP(best linear unbiased predictors)/bslmm(Bayesian linear mixed model) are conducted using gemma, lasso using plink, and enet (elastic net) using cv.glmnet function in R.

Before the weight calculation, the heritability of each gene are computed using GCTA; genes with insignificant heritability were screened out.

## Pre-requisites

We provide a container image `docker://gaow/twas` that contains all software needed to run the pipeline. If you would like to configure it by yourself, please make sure you install the following software before running this notebook:
- PLINK
- R package mashr
- R package mmbr
- Output from the univatiate analysis pipeline: twas_fusion_susie.ipynb

# Input and Output(TBD)
## Input
- `--gwas_sumstat` The GWAS sum stat text file documenting the associations between the SNP and the disease. It shall contain at least four column: the SNP rsID, the effect allele, the other allele, and the Z-score describing the relationship between the SNP and the disease. 
- `--genotype_list` An index text file with two columns of chromosome and the corresponding PLINK bed file.
- `--molecular-pheno`, The text file containing the table describing the molecular phenotype. It shall have regions(genes) as rows and samples as columnes
- `--region_list` The text file with 4 columns specifying the #Chr, P0 (Start position), P1(End position) and names of regions to analyze. The name of the column is not important but the order of the columns. It is also important that the column name of the first column starts with a #. The region_list can can be generated by using another sos pipeline SOS_ROSMAP_gene_exp_processing.ipynb.
- `--window` the region span from the specify start and end site for the cis-gene. If the gene expression only have one position column, set the window to a large number like 5E5.
- `--window` the region span from the specify start and end site for the cis-gene. If the gene expression only have one position column, set the window to a large number like 5E5.





**FIXME: here are also some text i dug up later in your workflow codes. Please explain them up-front. We would not expect users to read this notebook beyong the "Working example " section**
This probably is served as a self reminder of what each file do and therefore not expected the user to see, so maybe move them back?

```
1. The gene expression pheno type, a three column table for each genes, with the first two columns specifing the family ID and within family ID of the samples. In the current case where all samples are unrelated, the first two columns are simply sample ID. The third column is the actual gene expression value.
2. The plink trio file for each specific genes, containing only the snps corresponding th the regions whose expression are recorded. In particular, the snp are filtered according to the genetics regions outlined by Position+/-windows.
```



## Output

- `.wgt.Rdat` The actual weight data that are computed.
- `.weight_list.txt` The index text file recording information about each of the wgt.Rdat data, including the filename, corresponding region ID, Chromosomes, start and end position, heritability, and the SE and P-value of the heritability.
- `.dat` the actual TWAS association of the genes, a detailed description of this output are outlined here:http://gusevlab.org/projects/fusion/


 








# Command interface (TBD)

In [None]:
!sos run twas_fusion.ipynb -h

# Working example (TBD)
A minimal working example (MWE) dataset that can be downloaded from the private repo:
https://github.com/cumc/neuro-twas/blob/master/TWAS_pipeline_MWE%202.zip
the genotypes file can be downloaded from the following link:
https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2

**FIXME: please upload the data to synapse.org and take it out of github. On github we don't store large datasets. Please ask me about account information for synapse.org**

The time it take to run this MWE shall be around 2 minutes. Pay extra attention to the gene_start and gene_end position  when using following command on gene_exp file that are not this MWE. Also, when there is too few or too many genes that passed the heritability check, consider increasing or decreasing the --window options. 

In [1]:
## Test pipeline with test data
## Switch back to abosolute path, otherwise there will be file not found error in step 5
sos dryrun /Users/haosun/Documents/WG_Reasearch_Assisstant/GIT/freshcopy/neuro-twas/Workflow/mv_susie.ipynb mv_susie \
  --molecular_pheno_dir "molecular_phenotype_list" \
  --region_list region_list \
  --wd ./ \
  --name_prefix "geneTpmResidualsAgeGenderAdj_rename" \


[91mERROR[0m: [91mFailed to locate twas_fusion.ipynb.sos[0m



# Global parameter settings
The section outlined the parameters that can be set in the command interface.

In [5]:
[global]
# Path to a list of molecular phenotypes that are to be analysised, shall contains a cache file within it.
parameter: molecular_pheno_dir = path

parameter: region_list = path
# Path to the work directory of this pipeline,where the output will be stored.
parameter: wd = path
# Path to store the output folder
parameter: output_path = f'{wd:a}/result'
# Specify the number of jobs per run.
parameter: job_size = 2
# Container option for software to run the analysis: docker or singularity
parameter: container = 'gaow/twas'
# List of regions that are shared upon all three diretory
parameter: region_list = path
# name prefix of the molecular_pheno
parameter: name_prefix = "chr"

# Get regions of interest to focus on.
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
molecular_pheno = [x.strip().split() for x in open(molecular_pheno_dir).readlines() if x.strip() and not x.strip().startswith('#')]

## Merge of X (plink) and Y (R)
Creat merge list, and then merged based on merged list

In [None]:
[mv_susie_1]

input:  molecular_pheno_dir, for_each = "regions"
output: f'{wd:a}/cache/{name_prefix}.{_regions[0]}.merged_list',
        f'{wd:a}/cache/{name_prefix}.{_regions[0]}.merged.bed',
        f'{wd:a}/cache/{name_prefix}.{_regions[0]}.merged.exp'
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = '6G', tags = f'{step_name}_{_output[0]:bn}'        
bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    # Create merge list
    echo '$[next(iter(molecular_pheno[0]))]/cache/$[name_prefix].$[_regions[0]]' >> $[_output[0]]
    echo '$[next(iter(molecular_pheno[1]))]/cache/$[name_prefix].$[_regions[0]]' >> $[_output[0]]
    echo '$[next(iter(molecular_pheno[2]))]/cache/$[name_prefix].$[_regions[0]]' >> $[_output[0]]
    
    # create the merged output X
    plink --bfile '$[next(iter(molecular_pheno[0]))]/cache/$[name_prefix].$[_regions[0]]'\
          --merge-list $[_output[0]] \
          --mac 1 \
          --make-bed \
          --out $[_output[1]:n] \
          --allow-no-sex

R: expand = "$[ ]", stderr = f'{_output[2]}.stderr', stdout = f'{_output[2]}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("plink2R")
    genos_1_$[_regions[0]] = read_plink("$[next(iter(molecular_pheno[0]))]/cache/$[name_prefix].$[_regions[0]]")
    genos_2_$[_regions[0]] = read_plink("$[next(iter(molecular_pheno[1]))]/cache/$[name_prefix].$[_regions[0]]")
    genos_3_$[_regions[0]] = read_plink("$[next(iter(molecular_pheno[2]))]/cache/$[name_prefix].$[_regions[0]]")
    genos_1_$[_regions[0]]_fam = genos_1_$[_regions[0]]$fam%>%as_tibble()%>%mutate(name = paste(V1,":",V2,sep = ""))%>%select(name,V6)
    genos_2_$[_regions[0]]_fam = genos_2_$[_regions[0]]$fam%>%as_tibble()%>%mutate(name = paste(V1,":",V2,sep = ""))%>%select(name,V6)
    genos_3_$[_regions[0]]_fam = genos_3_$[_regions[0]]$fam%>%as_tibble()%>%mutate(name = paste(V1,":",V2,sep = ""))%>%select(name,V6)
    genos_join_phe_$[_regions[0]] = full_join(genos_1_$[_regions[0]]_fam,genos_2_$[_regions[0]]_fam,by = "name")
    genos_join_phe_$[_regions[0]] = full_join(genos_join_phe_$[_regions[0]],genos_3_$[_regions[0]]_fam,by = "name")
    genos_join_phe_$[_regions[0]]%>%readr::write_delim("$[_output[2]]",delim = "\t")

## performed MV susie
This step take the merged files from the previous step to performed susies.

In [225]:
[mv_susie_2]
input: group_by = 3, group_with = 'regions'
output:  f'{wd:a}/result/{_input[0]:bn}.mv_susie.model.RData',
         f'{wd:a}/result/{_input[0]:bn}.transformed_XY.RData',
         f'{wd:a}/result/{_input[0]:bn}.mv_wgt.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '6G', tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("plink2R")
    library("mashr")
    library("mmbr")  
    library("flashier")
    # Define functions
    ###Functions to compute MAF and missing genotype rate
    compute_maf <- function(geno){
      f <- mean(geno,na.rm = TRUE)/2
      return(min(f, 1-f))
    }
    
    compute_missing <- function(geno){
      miss <- sum(is.na(geno))/length(geno)
      return(miss)
    }
    
    mean_impute <- function(geno){
      f <- apply(geno, 2, function(x) mean(x,na.rm = TRUE))
      for (i in 1:length(f)) geno[,i][which(is.na(geno[,i]))] <- f[i]
      return(geno)
    }
    
    is_zero_variance <- function(x) {
      if (length(unique(x))==1) return(T)
      else return(F)
    }
    ### Filter X matrix
    filter_X <- function(X, missing_rate_thresh, maf_thresh) {
      rm_col <- which(apply(X, 2, compute_missing) > missing_rate_thresh)
      if (length(rm_col)) X <- X[, -rm_col]
      rm_col <- which(apply(X, 2, compute_maf) < maf_thresh)
      if (length(rm_col)) X <- X[, -rm_col]
      rm_col <- which(apply(X, 2, is_zero_variance))
      if (length(rm_col)) X <- X[, -rm_col]
      return(mean_impute(X))
    }
    ###Function to calculate the covariance matrix of Y via flash
    compute_cov_flash <- function(Y, miss=NULL){
      if(is.null(miss)){
        fl <- flashier::flash(Y, var.type = 2, prior.family = c(flashier::prior.normal(), flashier::prior.normal.scale.mix()), backfit = TRUE, verbose.lvl=0)
      } else {
        fl <- flashier::flash(Y[-miss, ], var.type = 2, prior.family = c(flashier::prior.normal(), flashier::prior.normal.scale.mix()), backfit = TRUE, verbose.lvl=0)
      }  
      if(fl$n.factors==0){
        covar <- diag(fl$residuals.sd^2)
      } else {
        fsd <- sapply(fl$fitted.g[[1]], '[[', "sd")
        covar <- diag(fl$residuals.sd^2) + crossprod(t(fl$flash.fit$EF[[2]]) * fsd)
      }
      return(covar)
    }
    ###Function to impute the missing X with means and then scale and center X
      impute_and_transform = function(genos){
      tmp = genos
      for(i in 1:ncol(tmp)){
        tmp[,i]=coalesce(tmp[,i],mean(tmp[,i]%>%na.omit()))%>%scale()}
        return(tmp)
      }
  
    # Load X data
    X_$[_regions[0]]_raw = read_plink("$[_input[1]:n]")$bed
    # Filter X by 0.1 NA and 0.01 MAF
    X_$[_regions[0]]_ftr = filter_X(X_$[_regions[0]],0.1,0.01)
    X_$[_regions[0]] = impute_and_transform(X_$[_regions[0]]_ftr)
    # Load Y data
    Y_$[_regions[0]] = read_delim("$[_input[2]]",delim = "\t")
    # Reorder Y based on X
    Y_$[_regions[0]] = Y_$[_regions[0]]%>%arrange(match(name,rownames(X_$[_regions[0]])))%>%select(-name)%>%as.matrix()
    # Compute the Cov matrix for Y via flashier
    Y_$[_regions[0]]_cov = Y_$[_regions[0]]%>%compute_cov_flash()
    # Impute the missing Y by mean and scale it.
    Y_$[_regions[0]] = Y_$[_regions[0]]%>%impute_and_transform()
    # Get prior
    prior_covar <- create_mash_prior(sample_data = list(X=X_$[_regions[0]],Y=Y_$[_regions[0]], residual_variance= Y_$[_regions[0]]_cov, max_mixture_len=-1)
    # Mv_susie
    m_$[_regions[0]] <- msusie(X_$[_regions[0]], 
                Y_$[_regions[0]], 
                L=10, 
                prior_variance=prior_covar,
                residual_variance = var(Y_complete_$[_regions[0]]),precompute_covariances = TRUE)
    #Add a hsq sub for the msusie object
    m_$[_regions[0]])$hsq = var(predict(m_$[_regions[0]]))/var(Y_$[_regions[0]])
  
    #Output: hsq estimated
    save(m_$[_regions[0]],file = "$[_output[0]]")
    #Output: scaled data
    scaled_$[_regions[0]] = list(X_$[_regions[0]],Y_$[_regions[0]])
    save(scaled_$[_regions[0]],file = "$[_output[1]]")
    #Output: Weight
    m_$[_regions[0]])$coef%>%write_delim("$[_output[2]]",delim = "\t")

## Merging all the RData file
THis step merged the output from  the previous step.

In [225]:
[mv_susie_3]
input: group_by = "all"
output:  f'{wd:a}/mv.RData'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '6G', tags = f'{step_name}_{_output:bn}'
R: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("purrr")
    # Load a template
    region = read_delim("$[region_list]",delim ="\t")%>%select(ID = `#region` )
    # get the path
    dir = "$[_input[0]:d]/"
    pre = "$[name_prefix]"
    sur = ".mv_susie.model.RData"
    region = region%>%mutate(path = map(ID, ~paste(collapse = "", c(dir,pre,".",.x,sur))))
    # Load the data
    output = region%>%mutate(env = map(path,~attach(.x)),
                            tb_name = map_chr(ID,~paste(collapse = "_", c("m",.x))),
                             model = map2(env,tb_name , ~get(.y,env = .x)))
    # Save the combined output
    save(output,file = "$[_output]")


