# TWAS multivariate susie
Ready to use pipeline that conduct multivariate susie based on the output of univariate susie.

## Aim

The aim of this workflow is to estimate the association between genotype and variouse molecular phenotypes.

## Pre-requisites

We provide a container image `docker://gaow/twas` that contains all software needed to run the pipeline. If you would like to configure it by yourself, please make sure you install the following software before running this notebook:
- tidyverse
- PLINK
- R package mashr
- R package mmbr
- Output from the following univatiate analysis pipeline: twas_fusion_susie.ipynb

# Input and Output
## Input

This workflow is design to performed based on the output of the uni_susie.ipynb pipeline. If other input are used. Please followed the following instructions.

- `molecular-pheno`, a plink trio per regions that are the output from the second steps of uni_susie.ipynb output. For each region, at least two sets of molecular-pheno are needed. For univariate analysis, please refer to the univariate sections of twas_fusion_susie.ipynb. The plink trio shall be named as following `{name_prefix}.{region}`.bed/fam/bim , as shown in the following example.

```
geneTpmResidualsAgeGenderAdj_rename.ENSG00000196126.bed  
geneTpmResidualsAgeGenderAdj_rename.ENSG00000196126.bim  
geneTpmResidualsAgeGenderAdj_rename.ENSG00000196126.fam
```




- `--molecular-pheno-dir` The file shall contains a colnames "#molc_pheno" which documenting all the paths to the diretory of twas_fusion_susie.ipynb output,as shown in the following example.

```
#molc_pheno
./AC
./PCC"
```
    If alternate input are used, all the molecular-pheno, in the form of plink trio, shall be stored in a "cache" directory, and the paths documented in this file shall be directed to the folder above the "cache" directory. 

- `--region_list` An index text file with a "#region" column documenting the {region} sections for each of the aforementioned plink trio as shown in the following example.

```
#region
ENSG00000196126
```

- `--name-prefix` the first part for the file name of each of the plink trio.

- `--cv_times` the number of times of cross validation to be ran.


## Output

- `.mv_cv.RData` An RData object containing all the susie objects with added hsq/RMSE/R2/Pval metrixs.

- `.mv.RData` An RData object containing all the susie objects with added hsq, without cross validation

- `{name_prefix}.{region}.transformed_XY.RData` A collection of Rdata objects stored in the "result" folder under the working diretory, storing the mean imputed and scaled X as well as the scaled Y for each region. 

- `.mv_wgt.txt` A collection of the actual weights that are computed for each genes used to predict the expresion. It works with the scaled X and Y.

- `.cv_diag.RData` A collection of Rdata objects stored in the "result" folder under the working diretory, storing the simulation dataset and the result for each run. 

# Command interface (TBD)

In [1]:
!sos run mv_susie.ipynb -h

[91mERROR[0m: [91mNotebook JSON is invalid: %s[0m


usage: sos run mv_susie.ipynb [workflow_name | -t targets] [options] [workflow_options]


  workflow_name:        Single or combined workflows defined in this script


  targets:              One or more targets to generate


  options:              Single-hyphen sos parameters (see "sos run -h" for details)


  workflow_options:     Double-hyphen workflow-specific parameters





Workflows:


  mv_susie


  mv_susie_cv





Global Workflow Options:


  --molecular-pheno-dir VAL (as path, required)


                        Path to a list of molecular phenotypes that are to be


                        analysised, shall contains a cache file within it.


  --region-list VAL (as path, required)


                        List of regions that are shared upon all three diretory


  --wd VAL (as path, required)


                        Path to the work directory of this pipeline,where the


                        output will be store

# Working example 
A minimal working example (MWE) dataset that can be downloaded from the following link, which required a synapse account:
https://www.synapse.org/#!Synapse:syn24179065

To test the command, please download and decompress the mwe folder, copy this file in it, and run the following command within the mwe folder.

Alternativly, the options below can be changed based on respective relative paths.

The time it take to run this MWE shall be around 5 minutes.

In [None]:
# Test the pipeline with MWE

sos run ./mv_susie.ipynb mv_susie_cv  \
--molecular_pheno_dir "molecular_phenotype_list"   \
--region_list region_list  \
--wd ./   \
--name_prefix "geneTpmResidualsAgeGenderAdj_rename" \
--container /mnt/mfs/statgen/containers/twas_latest.sif --impute TRUE \
--cv_times 2  &

# Test with prior
sos run ~/GIT/neuro-twas/Workflow/mv_susie.ipynb mv_susie_cv \
--molecular_pheno_dir "molecular_phenotype_list"   \
--region_list region_list  \
--wd ./   \
--name_prefix "geneTpmResidualsAgeGenderAdj_rename" \
--container /mnt/mfs/statgen/containers/twas_latest.sif --impute TRUE \
--cv_times 2  \
--mixture_prior '~/Project/Genome_prior/merge/output/geneTpmResidualsAgeGenderAdj_rename.Both.flash.FL_PC3.teem.UD_ED.rds'&



In [None]:
sos run ../../GIT/freshcopy/neuro-twas/Workflow/mv_susie.ipynb fusion_tf_cv \
--molecular_pheno_dir mole_pheno_ls   \
--region_list cand_rgs.txt  \
--wd ./   \
--name_prefix "geneTpmResidualsAgeGenderAdj_rename" \
--container gaow/twas --impute TRUE \
--cv_times 2  &

In [None]:
sos run ~/GIT/neuro-twas/Workflow/mv_susie.ipynb fusion_tf \ 
--molecular_pheno_dir "molecular_phenotype_list"   \
--region_list region_list  \
--wd ./   \
--name_prefix "geneTpmResidualsAgeGenderAdj_rename" \
--container /mnt/mfs/statgen/containers/twas_latest.sif --impute TRUE \
--cv_times 2  &

# Global parameter settings
The section outlined the parameters that can be set in the command interface.

In [5]:
[global]
# Path to a list of molecular phenotypes that are to be analysised, shall contains a cache file within it.
parameter: molecular_pheno_dir = path

parameter: region_list = path
# Path to the work directory of this pipeline,where the output will be stored.
parameter: wd = path
# Path to store the output folder
parameter: output_path = f'{wd:a}/result'
# Specify the number of jobs per run.
parameter: job_size = 2
# Container option for software to run the analysis: docker or singularity
parameter: container = 'gaow/twas'
# List of regions that are shared upon all three diretory, needs to have ID	CHR	P0	P1, the same way as that of the twas_fusion_susie pipeline.
parameter: region_list = path
# name prefix of the molecular_pheno
parameter: name_prefix = "chr"
# Whether impute the missing values
parameter: impute = "TRUE"
# propotion of samples set into testing, set to zero if no cv are needed.
parameter: testing_prop = 0.2
# Number of training & testing samples used
parameter: cv_times = 100
# Number of training & testing samples used
parameter: cv_times = 100
# Prior: list of prior of cov structure, an RDS file that is a list with an element "U" containing the cov structurer and an element "w" containintg the weights
# Preferably the output of wgs_prior_genome pipeline.
parameter: mixture_prior = "NULL"
# SNPs to be included in the final analysis
parameter: extract_snp = path

# Minor allele frequency that are used to filter X
parameter: MAF = 0.01



# Get regions of interest to focus on.
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
molecular_pheno = [x.strip().split() for x in open(molecular_pheno_dir).readlines() if x.strip() and not x.strip().startswith('#')]

## Merge of X (plink) and Y (R)
Creat merge list, and then merged based on merged list

In [1]:
[snp_exclude_1,mv_susie_1,mv_susie_cv_1]
input:  molecular_pheno_dir, for_each = "regions"
output: f'{wd:a}/cache/{name_prefix}.{_regions[0]}.merged_list',
        f'{wd:a}/cache/{name_prefix}.{_regions[0]}.merged.bed',
        f'{wd:a}/cache/{name_prefix}.{_regions[0]}.merged.exp'
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = '60G', tags = f'{step_name}_{_output[0]:bn}'  

R: expand = "$[ ]", stderr = f'{_output[2]}.stderr', stdout = f'{_output[2]}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("plink2R")
    library("purrr")
    library("readr")
    molecular_pheno = read_delim("$[molecular_pheno_dir]",delim = "\t")
    molecular_pheno = molecular_pheno%>%mutate(dir = map_chr(`#molc_pheno`,~paste(c(`.x`,"/cache/$[name_prefix].$[_regions[0]]"),collapse = "")))
    n = nrow(molecular_pheno)
    # For every tissues read plink, and extract the fam df.
    genos = tibble( i = 1:n)
    genos = genos%>%mutate(fam = map(i, ~read_plink(molecular_pheno[[.x,2]])$fam%>%as_tibble()%>%mutate(name = paste(V1,":",V2,sep = ""))%>%select(name,V6)))
    
    # Join two tissues
    genos_join_phe_$[_regions[0]] = full_join((genos%>%pull(fam))[[1]],(genos%>%pull(fam))[[2]],by = "name")
    
    # If there are more tissues, join the rest
    if(n > 2){
    for(j in 3:n){
    genos_join_phe_$[_regions[0]] = full_join(genos_join_phe_$[_regions[0]],(genos%>%pull(fam))[[j]],by = "name")
    }
    }
    genos_join_phe_$[_regions[0]]%>%readr::write_delim("$[_output[2]]",delim = "\t")
    
    # Create merge list
    molecular_pheno[2]%>%readr::write_delim("$[_output[0]]",delim = "\t",col_names=FALSE)


bash: expand = "$[ ]", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout',container = container
    # create the merged output X
    plink --bfile '$[next(iter(molecular_pheno[0]))]/cache/$[name_prefix].$[_regions[0]]'\
          --merge-list $[_output[0]] \
          --mac 1 \
          --make-bed \
          --out $[_output[1]:n] \
          --allow-no-sex \
          --extract $[extract_snp]

## Performed MV susie
This step filtered out some of the snvs that are deemed worthless

In [None]:
[snp_exclude_2]
parameter: bed_list = path
input: group_by = 3, group_with = 'regions'
output: f'{wd:a}/cache_snp/{name_prefix}.{_regions[0]}.merged.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'  
bash: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container = container
    # create the merged output X
    plink --bfile $[_input[1]:n] \
          --mac 1 \
          --make-bed \
          --out $[_output:n] \
          --allow-no-sex \
          --extract $[extract_snp]

## Performed MV susie
This step take the merged files from the previous step to performed mv susies. Before MV susie are done, the X are filter, mean-imputed, and then scaled. The Y are scaled. The covariance matrix of Y are computed via flashier

In [None]:
[mv_susie_2,mv_susie_cv_2,susie]
input: group_by = 3, group_with = 'regions'
output:  f'{wd:a}/result/{_input[0]:bn}.mv_susie.model.RData',
         f'{wd:a}/result/{_input[0]:bn}.transformed_XY.RData',
         f'{wd:a}/result/{_input[0]:bn}.mv_wgt.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '30G', tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("plink2R")
    library("mashr")
    library("mvsusieR")  
    library("flashier")
    library("modelr")
    # Define functions
    ###Functions to compute MAF and missing genotype rate
    compute_maf <- function(geno){
      f <- mean(geno,na.rm = TRUE)/2
      return(min(f, 1-f))
    }
    
    compute_missing <- function(geno){
      miss <- sum(is.na(geno))/length(geno)
      return(miss)
    }
    
    mean_impute <- function(geno){
      f <- apply(geno, 2, function(x) mean(x,na.rm = TRUE))
      for (i in 1:length(f)) geno[,i][which(is.na(geno[,i]))] <- f[i]
      return(geno)
    }
    
    is_zero_variance <- function(x) {
      if (length(unique(x%>%na.omit))==1) return(T)
      else return(F)
    }
    ### Filter X matrix
    filter_X <- function(X, missing_rate_thresh, maf_thresh) {
      rm_col <- which(apply(X, 2, compute_missing) > missing_rate_thresh)
      if (length(rm_col)) X <- X[, -rm_col]
      rm_col <- which(apply(X, 2, compute_maf) < maf_thresh)
      if (length(rm_col)) X <- X[, -rm_col]
      rm_col <- which(apply(X, 2, is_zero_variance))
      if (length(rm_col)) X <- X[, -rm_col]
      return(mean_impute(X))
    }
    ###Function to calculate the covariance matrix of Y via flash
    compute_cov_flash <- function(Y, miss=NULL){
      if(is.null(miss)){
        fl <- flashier::flash(Y, var.type = 2, prior.family = c(flashier::prior.normal(), flashier::prior.normal.scale.mix()), backfit = TRUE, verbose.lvl=0)
      } else {
        fl <- flashier::flash(Y[-miss, ], var.type = 2, prior.family = c(flashier::prior.normal(), flashier::prior.normal.scale.mix()), backfit = TRUE, verbose.lvl=0)
      }  
      if(fl$n.factors==0){
        covar <- diag(fl$residuals.sd^2)
      } else {
        fsd <- sapply(fl$fitted.g[[1]], '[[', "sd")
        covar <- diag(fl$residuals.sd^2) + crossprod(t(fl$flash.fit$EF[[2]]) * fsd)
      }
      return(covar)
    }
    ###Function to impute the missing X with means and then scale and center X
      impute_and_transform = function(genos,impute = TRUE){
      tmp = genos
      if(impute == TRUE){
      for(i in 1:ncol(tmp)){
        tmp[,i]=coalesce(tmp[,i],mean(tmp[,i]%>%na.omit()))%>%scale()}
        return(tmp)
      } else {
    for(i in 1:ncol(tmp)){
        tmp[,i]=tmp[,i]%>%scale()}
        return(tmp)}}
    if ($[impute] == TRUE){
    # Load X data
    X_$[_regions[0]]_raw = read_plink("$[_input[1]:n]")$bed
    # Filter X by 0.1 NA and 0.01 MAF
    X_$[_regions[0]]_ftr = filter_X(X_$[_regions[0]]_raw,0.1,$[MAF])
    X_$[_regions[0]] = impute_and_transform(X_$[_regions[0]]_ftr)
    # Load Y data
    Y_$[_regions[0]] = read_delim("$[_input[2]]",delim = "\t")
    # Reorder Y based on X
    Y_$[_regions[0]] = Y_$[_regions[0]]%>%arrange(match(name,rownames(X_$[_regions[0]])))%>%select(-name)%>%as.matrix()
    # Compute the Cov matrix for Y via flashier
    Y_$[_regions[0]] = impute_and_transform(Y_$[_regions[0]], impute = FALSE)
    Y_$[_regions[0]]_cov = Y_$[_regions[0]]%>%compute_cov_flash()
    # Get prior
    prior_covar <- create_mash_prior(sample_data = list(X=X_$[_regions[0]],Y=Y_$[_regions[0]], residual_variance= Y_$[_regions[0]]_cov, max_mixture_len=-1))
    } else {
    # Load data
    Y_$[_regions[0]] = read_delim("$[_input[2]]",delim = "\t")
    # Remove NA from bed
    X_$[_regions[0]] = read_plink("$[_input[1]:n]")$bed%>%t()%>%na.omit()%>%t()%>%impute_and_transform(impute = FALSE)
    # Reorder Y based on X
    Y_$[_regions[0]] = Y_$[_regions[0]]%>%arrange(match(name,rownames(X_$[_regions[0]])))%>%select(-name)%>%as.matrix()
    # Scale Y
    Y_complete_$[_regions[0]] = Y_$[_regions[0]]%>%na.omit()
    Y_$[_regions[0]] = impute_and_transform(Y_$[_regions[0]], impute = FALSE)
    # Get prior
    # Compute the Cov matrix for Y_complete
    Y_$[_regions[0]]_cov = cov(Y_complete_$[_regions[0]])}
   
   
    if('$[mixture_prior]' == 'NULL'){
    prior_covar <- create_mash_prior(sample_data = list(X=X_$[_regions[0]],Y=Y_$[_regions[0]], residual_variance= Y_$[_regions[0]]_cov, max_mixture_len=-1,center=F,scale=F))
    } else {
    mx_prior = readRDS('$[mixture_prior]')
    prior_covar <- create_mash_prior(mixture_prior = list( weights = mx_prior$w, matrices = mx_prior$U))
    }
   
    m_$[_regions[0]] = mvsusie(X_$[_regions[0]], 
                Y_$[_regions[0]], 
                L=10, 
                prior_variance=prior_covar,
                residual_variance = Y_$[_regions[0]]_cov,
                precompute_covariances = TRUE)
 
    #Add a hsq sub for the misuse object
    hsq_$[_regions[0]]=rep(0,ncol(Y_$[_regions[0]]))
    for (i in 1:ncol(Y_$[_regions[0]])){
      hsq_$[_regions[0]][i] = var(predict(m_$[_regions[0]])[,i])/var(Y_$[_regions[0]][,i]%>%na.omit())}
    m_$[_regions[0]]$hsq = hsq_$[_regions[0]]
    #Output: model with hsq estimated
    save(m_$[_regions[0]],file = "$[_output[0]]")
    #Output: scaled data
    scaled_$[_regions[0]] = list(X_$[_regions[0]],Y_$[_regions[0]])
    save(scaled_$[_regions[0]],file = "$[_output[1]]")
    #Output: Weight
    m_$[_regions[0]]$coef%>%as.data.frame()%>%write_delim("$[_output[2]]",delim = "\t")



## Perform Crossvalidation and stored the relevent matrixs
This step load the scaled X,Y output from the previouse step, perform CV, and calculate the diagnosis paramters: R2,P-value, and RMSE. The P value here is the indication of probability of observing the data under the null that there is no association between the predicted and actual Y

In [None]:
[mv_susie_cv_3,cv]
input: group_by = 3, group_with = 'regions'
output:  f'{wd:a}/result/{_input[0]:bn}.cv.RData',
         f'{wd:a}/result/{_input[0]:bn}.cv_diag.RData'
task: trunk_workers = 1, trunk_size = job_size, walltime = '30h',  mem = '60G', tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("plink2R")
    library("mashr")
    library("mvsusieR")  
    library("flashier")
    library("purrr")
    library("modelr")
    
    # Define functions
    compute_cov_flash <- function(Y, miss=NULL){
      if(is.null(miss)){
        fl <- flashier::flash(Y, var.type = 2, prior.family = c(flashier::prior.normal(), flashier::prior.normal.scale.mix()), backfit = TRUE, verbose.lvl=0)
      } else {
        fl <- flashier::flash(Y[-miss, ], var.type = 2, prior.family = c(flashier::prior.normal(), flashier::prior.normal.scale.mix()), backfit = TRUE, verbose.lvl=0)
      }  
      if(fl$n.factors==0){
        covar <- diag(fl$residuals.sd^2)
      } else {
        fsd <- sapply(fl$fitted.g[[1]], '[[', "sd")
        covar <- diag(fl$residuals.sd^2) + crossprod(t(fl$flash.fit$EF[[2]]) * fsd)
      }
      return(covar)
    }
    
    ## Compute rmse function
    compute_rmse = function(raw,fitted){
    rmse = rep(0,ncol(raw))
    for (i in 1:ncol(raw)){
      rmse[i] = ((fitted - raw)[,i])^2%>%mean(na.rm = TRUE)%>%sqrt() 
      }
    return(rmse)
    }
    
    ## Compute r2 function
    compute_r2 = function(raw,fitted){
      r2 = rep(0,ncol(raw))
      for (j in 1:ncol(raw)){
       r2[j] = summary(lm( fitted[,j] ~ raw[,j] ))$adj.r.sq
      }
      return(r2)
    }
    
    ## Compute r2 raw
    
    compute_r2_raw = function(raw,fitted){
      r2 = rep(0,ncol(raw))
      for (j in 1:ncol(raw)){
        r2[j] =  cor(fitted[,j][which(!is.na(raw[,j]))],raw[,j]%>%na.omit())^2
      }
      return(r2)
    }
    
    ## Get P.value
    compute_pval = function(raw,fitted){
      pval = rep(0,ncol(raw))
      for (k in 1:ncol(raw)){
        pval[k] = summary(lm( fitted[,k] ~ raw[,k] ))$coef[2,4]
      }
      return(pval)
    }
    
    

    ###Functions to compute MAF and missing genotype rate
    compute_maf <- function(geno){
      f <- mean(geno,na.rm = TRUE)/2
      return(min(f, 1-f))
    }
    
    compute_missing <- function(geno){
      miss <- sum(is.na(geno))/length(geno)
      return(miss)
    }
    
    mean_impute <- function(geno){
      f <- apply(geno, 2, function(x) mean(x,na.rm = TRUE))
      for (i in 1:length(f)) geno[,i][which(is.na(geno[,i]))] <- f[i]
      return(geno)
    }
    
    is_zero_variance <- function(x) {
      if (length(unique(x))==1) return(T)
      else return(F)
    }
    ### Filter X matrix
    filter_X <- function(X, missing_rate_thresh, maf_thresh) {
      rm_col <- which(apply(X, 2, compute_missing) > missing_rate_thresh)
      if (length(rm_col)) X <- X[, -rm_col]
      rm_col <- which(apply(X, 2, compute_maf) < maf_thresh)
      if (length(rm_col)) X <- X[, -rm_col]
      rm_col <- which(apply(X, 2, is_zero_variance))
      if (length(rm_col)) X <- X[, -rm_col]
      return(mean_impute(X))
    }
    
    ### Produce CV dataset
    cv_data_gen = function(X,Y,times,test_prop){
    # Merged the X and Y for producing testing and training set for modelr cv
    cv_df_raw = cbind(X,Y)%>%as_tibble() 
    cv_df = crossv_mc(cv_df_raw, times,test = test_prop)%>%mutate(
      train_X = map(train,~as_tibble(.x)[1:ncol(X)]%>%as.matrix),
      train_Y = map(train,~as_tibble(.x)[(ncol(X)+1):(ncol(X)+ncol(Y))]%>%as.matrix),
      test_X = map(test,~as_tibble(.x)[1:ncol(X)]%>%as.matrix),
      test_Y = map(test,~as_tibble(.x)[(ncol(X)+1):(ncol(X)+ncol(Y))]%>%as.matrix)
    )  
    
    # Filter Train X with maf and missing, filter test X with the same col as Train X
    cv_df = cv_df%>%mutate(
    train_X = map(train_X,~filter_X(.x,0.1,$[MAF])),
    test_X = map2(test_X,train_X,~.x%>%as_tibble()%>%select(colnames(.y))%>%as.matrix())
    )
    return(cv_df)
    }
    
    # Load data
    full_model = attach('$[_input[0]]')
    full_model = full_model$m_$[_regions[0]]
    X = attach('$[_input[1]]')$scaled_$[_regions[0]][[1]]
    Y = attach('$[_input[1]]')$scaled_$[_regions[0]][[2]]
    # Generate cv dataaset
    
    
    cv_df = cv_data_gen(X,Y,$[cv_times],$[testing_prop])
    
                                                      
    # Compute the cov matrix for training set Y based on the choice of imputation                  
                         
    if ($[impute] == TRUE){
        cv_df = cv_df%>%mutate( 
        cov = map(train_Y,~.x%>%compute_cov_flash())
      )
        }else{
      cv_df = cv_df%>%mutate(
        cov = map(train_Y,~cov(.x%>%na.omit)))}

    # Actual cv
    
    if('$[mixture_prior]'=='NULL'){

    
    cv_df = cv_df%>%mutate(
    
    ## Get the prior
    
        prior = pmap(list(train_X,train_Y,cov),function(first,second,third)(
        create_mash_prior(sample_data = list( X = first, Y = second, residual_variance = third, max_mixture_len =-1,center =F,scale =F)
        ))) ,
        
    ## Do mvsusie
    
      mvsusie = pmap(list(train_X,train_Y,cov,prior),function(first,second,third,forth)(
        mvsusie(first,second, L=10, prior_variance = forth,residual_variance = third,precompute_covariances = TRUE)
        ))
        
        )} else {
   
       mx_prior = readRDS('$[mixture_prior]')
    ## Get the prior
       prior_covar <- create_mash_prior(mixture_prior = list( weights = mx_prior$w, matrices = mx_prior$U))
       
       cv_df = cv_df%>%mutate(
        
    ## Do mvsusie
    
      mvsusie = pmap(list(train_X,train_Y,cov),function(first,second,third)(
        mvsusie(first,second, L=10, prior_variance = prior_covar,residual_variance = third,precompute_covariances = TRUE)
        )))
    }
  
    ## Shrinkage of weights
    
  
  
  
  
    # Extract data 
    
    cv_df = cv_df%>%mutate(
      weight = map(mvsusie,~.x$coef),
      test_fitted = map2(mvsusie,test_X,~predict.mvsusie(.x,.y)),
      rmse = map2(test_Y,test_fitted,~compute_rmse(.x,.y)),
      r2 = map2(test_Y,test_fitted,~compute_r2(.x,.y)),
      r2_raw = map2(test_Y,test_fitted,~compute_r2_raw(.x,.y)),
      pval = map2(test_Y,test_fitted,~compute_pval(.x,.y))
    )
    
    # Calculate metrics
    
    mean_rmse = cv_df%>%pull(rmse)%>%as.data.frame()%>%t()%>%as_tibble()%>%na.omit()%>%colMeans()
    mean_r2 = cv_df%>%pull(r2)%>%as.data.frame()%>%t()%>%as_tibble()%>%na.omit()%>%colMeans()
    mean_r2_raw = cv_df%>%pull(r2_raw)%>%as.data.frame()%>%t()%>%as_tibble()%>%na.omit()%>%colMeans()
    mean_pval = cv_df%>%pull(pval)%>%as.data.frame()%>%t()%>%as_tibble()%>%na.omit()%>%colMeans()

  
    # Save metrics
    full_model$rmse = mean_rmse
    full_model$r2 = mean_r2 
    full_model$r2_raw = mean_r2_raw    
    full_model$pval = mean_pval
    full_model$snps = colnames(X)

    # Save the CV data
    save(cv_df,file = "$[_output[1]]")
    
    #Output
    save(full_model,file = "$[_output[0]]")
    
   
                    
                    
                    
                    

## Merging all the RData file
THis step merged the output from  the previous step.

In [225]:
[mv_susie_3]
input: group_by = "all"
output:  f'{wd:a}/mv.RData'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '60G', tags = f'{step_name}_{_output:bn}'
R: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("purrr")
    # Load a template
    region = read_delim("$[region_list]",delim ="\t")%>%select(ID = `#region` )
    # get the path
    dir = "$[_input[0]:d]/"
    pre = "$[name_prefix]"
    sur = ".mv_susie.model.RData"
    region = region%>%mutate(path = map(ID, ~paste(collapse = "", c(dir,pre,".",.x,sur))))
    # Load the data
    output = region%>%mutate(env = map(path,~attach(.x)),
                            tb_name = map_chr(ID,~paste(collapse = "_", c("m",.x))),
                             model = map2(env,tb_name , ~get(.y,env = .x)))
    # Save the combined output
    save(output,file = "$[_output]")

In [None]:
[mv_susie_cv_4]
input: group_by = "all"
output:  f'{wd:a}/mv_cv.RData'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '60G', tags = f'{step_name}_{_output:bn}'
R: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("purrr")
    # Load a template
    region = read_delim("$[region_list]",delim ="\t")%>%select(ID = `#region` )
    # get the path
    dir = "$[_input[0]:d]/"
    pre = "$[name_prefix]"
    sur = ".mv_susie.model.cv.RData"
    region = region%>%mutate(path = map(ID, ~paste(collapse = "", c(dir,pre,".",.x,sur))))
    # Load the data
    output = region%>%mutate(env = map(path,~attach(.x)),
                            tb_name = "full_model",
                             model = map2(env,tb_name , ~get(.y,env = .x)))
    # Save the combined output
    save(output,file = "$[_output]")

## Fusion transformed
These step seperate the output from the previous step into the wgt file for each tissues that can be input to the Fusion Association testing pipeline

In [1]:
# Create wgs.RDat file
[fusion_tf_cv_1]
input:  molecular_pheno_dir,for_each = 'regions'
output: dynamic(f'{wd:a}/wgt/*/{name_prefix}.{_regions[0]}.mv.wgt.RDat')
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '60G', tags = f'{step_name}_{_index}'
R: expand= "$[ ]", stderr = f'{_input[0]}.split.stderr', stdout = f'{_input[0]}.split.stderr',container = container
    library("dplyr")
    library("plink2R")
    library("tibble")
    library("readr")
    library("purrr")
    #Load the input
    genos = read_plink('$[wd:a]/cache/$[name_prefix].$[_regions[0]].merged')
    molecular_pheno = read_delim('$[_input[0]]',delim = '\t')
    load('$[wd:a]/result/$[name_prefix].$[_regions[0]].mv_susie.model.cv.RData')
    #Get all the components
    pval = full_model$pval
    rsq = full_model$r2
    
    hsq.pv = NA
    N.tot = nrow(genos$bed)
    
    cv.performance_tol = rbind(
    rsq = rsq, 
    pval = pval
    )
    ## Filter out the snps that are not in the bim for consistancy
    snps = genos$bim%>%filter(V2 %in%full_model$snps )
    
    # Create output for each tissue saperately

    for(i in 1:nrow(molecular_pheno) ){
    wgt.matrix = full_model$coef[2:nrow(full_model$coef),i]%>%as.matrix()
    hsq = full_model$hsq[i]
    dir = "$[wd:a]/wgt/"
    tis = read.table(text = molecular_pheno[[i,1]], sep = "/", as.is = TRUE)
    tis = tis[[length(tis)]]
    sur = "/$[name_prefix].$[_regions[0]].mv.cv.wgt.RDat"
    out = paste(collapse = "",c(dir,tis,sur))
    cv.performance = cv.performance_tol[,i]%>%as.matrix()
    colnames(cv.performance) = "mv_susie"
    # make the folders
    cmd0 = paste(c("mkdir ",dir),collapse = "")
    system(cmd0,ignore.stdout=TRUE,ignore.stderr=TRUE)
    cmd1 = paste(c("mkdir ",dir,tis),collapse = "")
    system(cmd1,ignore.stdout=TRUE,ignore.stderr=TRUE)
    # save the files
    save(
    wgt.matrix,
    snps,
    cv.performance,
    hsq, hsq.pv, N.tot,
    file = out
    )    }
    

ERROR: Error in parse(text = x, srcfile = src): <text>:2:1: unexpected '['
1: # Create wgs.RDat file
2: [
   ^


In [None]:
# Create wgs.RDat file
[fusion_tf_1]
input:  molecular_pheno_dir,for_each = 'regions'
output: dynamic(f'{wd:a}/wgt/*/{name_prefix}.{_regions[0]}.mv.wgt.RDat')
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '60G', tags = f'{step_name}_{_index}'
R: expand= "$[ ]", stderr = f'{_input[0]}.split.stderr', stdout = f'{_input[0]}.split.stderr',container = container
    library("dplyr")
    library("plink2R")
    library("tibble")
    library("readr")
    library("purrr")
    #Load the input
    genos = read_plink('$[wd:a]/cache/$[name_prefix].$[_regions[0]].merged')
    molecular_pheno = read_delim('$[_input[0]]',delim = '\t')
    full_model = attach('$[wd:a]/result/$[name_prefix].$[_regions[0]].mv_susie.model.RData')$m_$[_regions[0]]
    X_dat = attach('$[wd:a]/result/$[name_prefix].$[_regions[0]].transformed_XY.RData')$scaled_$[_regions[0]]
    X_snps = colnames(X_dat[[1]])
    #Get all the components
    pval = NA
    rsq = NA
    hsq.pv = NA
    N.tot = nrow(genos$bed)
    cv.performance_tol = rbind(
    rsq = rsq, 
    pval = pval
    )
    ## Filter out the snps that are not in the bim for consistancy
    snps = genos$bim%>%filter(V2 %in% X_snps )
    
    # Create output for each tissue saperately

    for(i in 1:nrow(molecular_pheno) ){
    wgt.matrix = full_model$coef[2:nrow(full_model$coef),i]%>%as.matrix()
    hsq = full_model$hsq[i]
    dir = "$[wd:a]/wgt/"
    tis = read.table(text = molecular_pheno[[i,1]], sep = "/", as.is = TRUE)
    tis = tis[[length(tis)]]
    sur = "/$[name_prefix].$[_regions[0]].mv.wgt.RDat"
    out = paste(collapse = "",c(dir,tis,sur))
    cv.performance = cv.performance_tol
    colnames(cv.performance) = "mv_susie"
    # make the folders
    cmd0 = paste(c("mkdir ",dir),collapse = "")
    system(cmd0,ignore.stdout=TRUE,ignore.stderr=TRUE)
    cmd1 = paste(c("mkdir ",dir,tis),collapse = "")
    system(cmd1,ignore.stdout=TRUE,ignore.stderr=TRUE)
    # save the files
    save(
    wgt.matrix,
    snps,
    cv.performance,
    hsq, hsq.pv, N.tot,
    file = out
    )    }

In [None]:
[fusion_tf_cv_2]
molecular_pheno = [x.strip().split("/") for x in open(molecular_pheno_dir).readlines() if x.strip() and not x.strip().startswith('#')]
input: region_list, for_each = "molecular_pheno"
output: f'{wd:a}/wgt/{_molecular_pheno[1]}/All_wgt_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '60G', tags = f'{step_name}_{_index}'
R: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("purrr")
    # Load the template
    temp = read_delim('$[_input]',delim = "\t")
    # Create name
    dir = "$[wd:a]/wgt/"
    tis = "$[_molecular_pheno[1]]"
    pre = "/$[name_prefix]."
    sur = ".mv.cv.wgt.RDat"
    res = temp%>%mutate(
    WGT = map_chr(`#region`,~paste(collapse = "", c(dir,tis,pre,.x,sur)))
    )%>%select(WGT,ID = `#region`,CHR = chr, P0 = start_position, P1 = end_position)
    res%>%write_delim("$[_output]",delim = "\t")

In [None]:
[fusion_tf_2]
molecular_pheno = [x.strip().split("/") for x in open(molecular_pheno_dir).readlines() if x.strip() and not x.strip().startswith('#')]
input: region_list, for_each = "molecular_pheno"
output: f'{wd:a}/wgt/{_molecular_pheno[1]}/All_wgt_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '60G', tags = f'{step_name}_{_index}'
R: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("purrr")
    # Load the template
    temp = read_delim('$[_input]',delim = "\t")
    # Create name
    dir = "$[wd:a]/wgt/"
    tis = "$[_molecular_pheno[1]]"
    pre = "/$[name_prefix]."
    sur = ".mv.wgt.RDat"
    res = temp%>%mutate(
    WGT = map_chr(`#region`,~paste(collapse = "", c(dir,tis,pre,.x,sur)))
    )%>%select(WGT,ID = `#region`,CHR = chr, P0 = start_position, P1 = end_position)
    res%>%write_delim("$[_output]",delim = "\t")