# Susie whole genome sequencing
Ready to use pipeline to estimate univariate association between a molecular phenotype and the genotypes via SuSiE. 

## Pre-requisites

We provide a container image `docker://gaow/twas` that contains all software needed to run the pipeline. If you would like to configure it by yourself, please make sure you install the following software before running this notebook.

Bash:
- GCTA
- PLINK

R: 
- Tidyverse
- susieR
- modelr
- abind

# Input and Output
## Input

- `--genotype_list` An index text file with two columns of chromosome and the corresponding PLINK bed file.
- `--molecular-pheno`, The text file containing the table describing the molecular phenotype. It shall have regions(genes) as rows and samples as columnes
- `--region_list` The text file with 4 columns specifying the #Chr, P0 (Start position), P1(End position) and names of regions to analyze. The name of the column is not important but the order of the columns. It is also important that the column name of the first column starts with a #. The region_list can can be generated by using another sos pipeline SOS_ROSMAP_gene_exp_processing.ipynb.

## Output

- `uni_weight.RDS` a RDS file that served as the input for the mixture pipeline.
- `susie.RData` a R object containing all the susie output for each of the regions
 

# Command interface 

In [2]:
!sos run twas_fusion.ipynb -h

usage: sos run twas_fusion.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  twas_fusion
  association_test

Global Workflow Options:
  --molecular-pheno VAL (as path, required)
                        Path to the input molecular phenotype data.
  --gwas-sumstat VAL (as path, required)
                        Path to GWAS summary statistics data (association
                        results between SNP and disease in a GWAS)
  --genotype-list VAL (as path, required)
                        An index text file with two columns of chromosome and
                        the corresponding PLINK bed file.
  --region-list VAL (as path, required)
                        An index text file 

# Working example
The MWE file is availble at :
"https://www.synapse.org/#!Synapse:syn24179064/files/"

The time it take to run this MWE shall be around 2 minutes. Pay extra attention to the gene_start and gene_end position  when using following command on gene_exp file that are not this MWE. Also, when there is too few or too many genes that passed the heritability check, consider increasing or decreasing the --window options. 

In [1]:
## Test pipeline with test data
## Switch back to abosolute path, otherwise there will be file not found error in step 5
sos run susie-wgs-prior.ipynb susie \
  --molecular-pheno ./molecular_phenotype \
  --wd ./ \
  --genotype_list ./geno_dir\
  --region_list ./region_list \
  --region_name 1 \
  --data_start 5 \
  --window 500000 \
  --container /mnt/mfs/statgen/containers/twas_latest.sif 


[91mERROR[0m: [91mFailed to locate twas_fusion.ipynb.sos[0m



# Global parameter settings
The section outlined the parameters that can be set in the command interface.

In [5]:
[global]
# Path to the input molecular phenotype data.
parameter: molecular_pheno = path
# An index text file with two columns of chromosome and the corresponding PLINK bed file.
parameter: genotype_list = path
# An index text file with 4 columns specifying the chr, start, end and names of regions to analyze
parameter: region_list = path
# Path to the work directory of the weight computation: output weights and cache will be saved to this directory.
parameter: wd = path('./')
# Specify the directory to save fitted weights
parameter: weights_path = f'{wd:a}/WEIGHTS'
# Path to list of weights
parameter: weights_list = f'{weights_path}/{molecular_pheno:bn}.weights_list.txt'
# Path to store the output folder
parameter: output_path = f'{wd:a}/result'
# Specify the column in the molecular_pheno file that contains the name of the region of interest
parameter: region_name = int
# Specify the column in the molecular_pheno file where the actual data start
parameter: data_start = int
# Specify the scanning window for the up and downstream radius to analyze around the region of interest, in units of Kb
parameter: window = 50000
# Specify the number of jobs per run.
parameter: job_size = 2

# Container option for software to run the analysis: docker or singularity
parameter: container = 'gaow/twas'

# Propotion of samples set into testing, set to zero if no cv are needed.
parameter: testing_prop = 0.2
# Number of training & testing samples used
parameter: cv_times = 2

# Minor allele frequency that are used to filter X
parameter: MAF = 0.01



# parameters for the susie pipelines.
parameter: causal_variables_L = 10
parameter: scaled_prior_variance = 0.1

# Get regions of interest to focus on.
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]

geno_inventory = dict([x.strip().split() for x in open(genotype_list).readlines() if x.strip() and not x.strip().startswith('#')])

import os
def get_genotype_file(chrom, genotype_list, geno_inventory):
    chrom = f'{chrom}'
    if chrom.startswith('chr'):
        chrom = chrom[3:]
    if chrom not in geno_inventory:
        geno_file = f'{chrom}'
    else:
        geno_file = geno_inventory[chrom]
    if not os.path.isfile(geno_file):
        # relative path
        if not os.path.isfile(f'{genotype_list:ad}/' + geno_file):
            raise ValueError(f"Cannot find genotype file {geno_file}")
        else:
            geno_file = f'{genotype_list:ad}/' + geno_file
    return path(geno_file)

## Partition of the molecular phenotype for each genes
This step extracts the molecular phenotype for each gene and transposes them into the formats needed in the follow-up analysis.

In [11]:
[hsq_1,susie_1,susie_cv_1]
input: molecular_pheno, for_each = "regions"
output: f'{wd:a}/cache/{_input:bn}.{_regions[3]}.exp',
        f'{wd:a}/cache/{_input:bn}.{_regions[3]}.pheno'
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'
R: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    # Get the line number for the region in the file
    line_num = system("awk '($$[region_name]==\"$[_regions[3]]\") {print NR}' $[_input]", intern=T)
    if (length(line_num) == 0){
      stop( "Cannot find $[_regions[3]] in column $[region_name]  $[_input]")}
    yi <- data.table::fread(file = $[_input:r], skip = as.integer(line_num) - 1, nrows = 1)
    samplenames_yi <- data.table::fread(file = $[_input:r], skip = 0, nrows = 1)
    colnames(yi) <- colnames(samplenames_yi)
    readr::write_tsv(yi, path = "$[_output[0]]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")
    yi <- as.data.frame(yi[, $[data_start]:ncol(yi), drop = FALSE])
    yj <- rbind(colnames(yi),colnames(yi),yi)
    readr::write_tsv(as.data.frame(t(yj)), path = "$[_output[1]]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")

## Construction of Plink trio for each gene
This step constructs the plink file for each gene based on the output of previous steps. Specifically it:

1. Selects only the SNPs within the start and end position of the corresponding region (gene)
2. Replaces the Phenotype value (last column) of the .fam based on the input


In [225]:
[hsq_2,susie_2,susie_cv_2]
input: group_by = 2, group_with = 'regions'
output: f'{_input[0]:n}.bed',
        f'{_input[0]:n}.bim',
        f'{_input[0]:n}.fam'

# look up for genotype file
geno_file = get_genotype_file(_regions[3],genotype_list,geno_inventory)

parameter: extract_snp = f'{geno_file:an}.bim'
parameter: exclude_snp = "./."
parameter: keep_sample = f'{_input[1]}'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, volumes = [f'{geno_file:ad}:{geno_file:ad}']
    ##### Get the locus genotypes for $[_regions[3]]
    plink --bfile $[geno_file:an] \
    --pheno $[_input[1]] \
    --make-bed \
    --out $[_output[0]:n] \
    --chr $[_regions[0]] \
    --from-bp $[int(_regions[1]) - window ] \
    --to-bp $[int(_regions[1]) + window ] \
    --keep $[keep_sample] \
    --extract $[extract_snp]
    --exclude $[exclude_snp]
    --allow-no-sex || true
    touch $[_output]

## Heritability Estimation

In [None]:
[hsq_3]
input: group_by = 2, group_with = 'regions'
output: f'{_input[0]:n}.bed',
        f'{_input[0]:n}.bim',
        f'{_input[0]:n}.fam'

import os
skip_if(os.path.getsize(f'{_input[0]}') == 0)
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = "20G" , tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container,volumes = [f'{wd:a}:{wd:a}']
    library("dplyr")
    library("tibble")
    library("susieR")
    library("plink2R")
    library("readr")
    library("modelr")
    library("purrr")
    library("abind")



    # Perform i/o checks here:
    files = c($[_input[0]],$[_input[1]],$[_input[2]])

    for ( f in files ) {
        if ( !file.exists(f) ){
            cat( "ERROR: ", f , " input file does not exist\n" , sep='', file=stderr() )
            cleanup()
            q()
        }
    }

    if ( system( paste("plink","--help") , ignore.stdout=T,ignore.stderr=T ) != 0 ) {
        cat( "ERROR: plink could not be executed \n" , sep='', file=stderr() )
        cleanup()
        q()
    }

    if ( system( "gcta64" , ignore.stdout=T,ignore.stderr=T ) != 0 ){
        cat( "ERROR: gcta (gcta64) could not be executed " , sep='', file=stderr() )
        cleanup()
        q()
    }

    # --- 
    # Set up the "input"
    opt = list()
    opt$bfile = "$[_input[0]:n]"
    opt$tmp = "$[_input[0]:n].tmp"
    opt$PATH_plink = "plink"
    opt$PATH_gcta = "gcta64"
    
    
    # ---

    fam = read.table(paste(opt$bfile,".fam",sep=''),as.is=T)

    # Make/fetch the phenotype file
    
        pheno.file = paste(opt$tmp,".pheno",sep='')
        pheno = fam[,c(1,2,6)]
        write.table(pheno,quote=F,row.names=F,col.names=F,file=pheno.file)
   
    geno.file = opt$tmp
    # recode to the intersection of samples and new phenotype
    arg = paste( opt$PATH_plink ," --allow-no-sex --bfile ",opt$bfile," --pheno ",pheno.file," --keep ",pheno.file," --make-bed --out ",geno.file,sep='')
    system(arg , ignore.stdout=SYS_PRINT,ignore.stderr=SYS_PRINT)

    # --- HERITABILITY ANALYSIS
    
    # 1. generate GRM
    arg = paste( opt$PATH_plink," --allow-no-sex --bfile ",opt$tmp," --make-grm-bin --out ",opt$tmp,sep='')
    system(arg , ignore.stdout=SYS_PRINT,ignore.stderr=SYS_PRINT)

    # 2. estimate heritability
    if ( !is.na(opt$covar) ) {
    arg = paste( opt$PATH_gcta ," --grm ",opt$tmp," --pheno ",raw.pheno.file," --qcovar ",opt$covar," --out ",opt$tmp," --reml --reml-no-constrain --reml-lrt 1",sep='')
    } else {
    arg = paste( opt$PATH_gcta ," --grm ",opt$tmp," --pheno ",pheno.file," --out ",opt$tmp," --reml --reml-no-constrain --reml-lrt 1",sep='')
    }
    system(arg , ignore.stdout=SYS_PRINT,ignore.stderr=SYS_PRINT)

    # 3. evaluate LRT and V(G)/Vp
    if ( !file.exists( paste(opt$tmp,".hsq",sep='') ) ) {
        cat(opt$tmp,"does not exist, likely GCTA could not converge, skipping gene\n",file=stderr())
        cleanup()
        q()
    }

    hsq.file = read.table(file=paste(opt$tmp,".hsq",sep=''),as.is=T,fill=T)
    hsq = as.numeric(unlist(hsq.file[hsq.file[,1] == "V(G)/Vp",2:3]))
    hsq.pv = as.numeric(unlist(hsq.file[hsq.file[,1] == "Pval",2]))

    if ( opt$verbose >= 1 ) cat("Heritability (se):",hsq,"LRT P-value:",hsq.pv,'\n')
    if ( opt$save_hsq ) cat( opt$out , hsq , hsq.pv , '\n' , file=paste(opt$out,".hsq",sep='') )

    # 4. stop if insufficient
    if ( hsq[1] < 0 || hsq.pv > opt$hsq_p ) {
        cat(opt$tmp," : heritability ",hsq[1],"; LRT P-value ",hsq.pv," : skipping gene\n",sep='',file=stderr())
        cleanup()
        q()
    }

## Conducting univariate test for all the genes and save the sumstat

In [None]:
#Susie test
[susie_3,susie_cv_3]
input:  group_by = 3, group_with = 'regions'
output: f'{wd:a}/susie/{_input[0]:bn}.susie.model.RData',
        f'{wd:a}/susie/{_input[0]:bn}.uni_weight.rds'

import os
skip_if(os.path.getsize(f'{_input[0]}') == 0)
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = "60G" , tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container,volumes = [f'{wd:a}:{wd:a}']
    library("dplyr")
    library("tibble")
    library("susieR")
    library("plink2R")
    library("readr")
    library("modelr")
    library("purrr")
    library("abind")

    # Define functions
    ###Functions to compute MAF and missing genotype rate
    compute_maf <- function(geno){
      f <- mean(geno,na.rm = TRUE)/2
      return(min(f, 1-f))
    }
    
    compute_missing <- function(geno){
      miss <- sum(is.na(geno))/length(geno)
      return(miss)
    }
    
    mean_impute <- function(geno){
      f <- apply(geno, 2, function(x) mean(x,na.rm = TRUE))
      for (i in 1:length(f)) geno[,i][which(is.na(geno[,i]))] <- f[i]
      return(geno)
    }
    
    is_zero_variance <- function(x) {
      if (length(unique(x%>%na.omit()))==1) return(T)
      else return(F)
    }
    ### Filter X matrix
    filter_X <- function(X, missing_rate_thresh, maf_thresh) {
      rm_col <- which(apply(X, 2, compute_missing) > missing_rate_thresh)
      if (length(rm_col)) X <- X[, -rm_col]
      rm_col <- which(apply(X, 2, compute_maf) < maf_thresh)
      if (length(rm_col)) X <- X[, -rm_col]
      rm_col <- which(apply(X, 2, is_zero_variance))
      if (length(rm_col)) X <- X[, -rm_col]
      return(mean_impute(X))}
      
    ###Function to impute the missing X with means and then scale and center X
      impute_and_transform = function(genos,impute = TRUE){
      tmp = genos
      if(impute == TRUE){
      for(i in 1:ncol(tmp)){
        tmp[,i]=coalesce(tmp[,i],mean(tmp[,i]%>%na.omit()))%>%scale()}
        return(tmp)
      } else {
    for(i in 1:ncol(tmp)){
        tmp[,i]=tmp[,i]%>%scale()}
        return(tmp)}}
        
    ###Function to impute the weight and standard errors
    
    mm_regression = function(X, Y, Z=NULL,center=TRUE,scale=TRUE) {
    ## HS: Make sure X and Y is matrixs as well otherwise the ncol(Y) give error
    X = as.matrix(X)
    Y = as.matrix(Y)
    if (!is.null(Z)) {
    Z = as.matrix(Z)
    }
    if(any(is.na(Y))){
    reg = lapply(seq_len(ncol(Y)), function (i) simplify2array(susieR:::univariate_regression(X, Y[,i], Z, center, scale)))
    reg = do.call(abind, c(reg, list(along=0)))
    # return array: out[1,,] is betahat, out[2,,] is shat
    out = aperm(reg, c(3,2,1))
    #HS Force dimension for the matrix slice.
    out = list(bhat = as.matrix(out[1,,]), sbhat=as.matrix(out[2,,]))
    }else{
    out = univariate_regression(X, Y, Z, center = F, scale = F)
    #HS Original out dont has betahat instead of bhat, fixed now.
    out$bhat = as.matrix(out$betahat)
    out$sbhat = as.matrix(out$sebetahat)
    }
    if (!is.null(colnames(X))) {
    rownames(out$bhat) = colnames(X)
    rownames(out$sbhat) = colnames(X)
    }
    if (!is.null(colnames(Y))) {
    colnames(out$bhat) = colnames(Y)
    colnames(out$sbhat) = colnames(Y)
    }
    # `out` is a list of bhat and sbhat
    return(out)
    }

    # Load data and transform
    genos = read_plink("$[_input[0]:n]")
  
    # Filter X by 0.1 NA and 0.01 MAF, and then transfrom X
    
    X_ftr = filter_X(genos$bed,0.1,$[MAF])
    X = impute_and_transform(X_ftr,impute = FALSE)
    
    Y = genos$fam%>%as_tibble()%>%mutate(name = paste(V1,":",V2,sep = ""))
    
    # Make sure X and Y have the same order
    Y = Y%>%arrange(match(name,rownames(X)))%>%select(V6)
    
    
    # Center and scale Y 
    
    Y = impute_and_transform(Y, impute = FALSE )
    Y = Y$V6
    
    # Susie with full samples
    full_model = susie(X, Y,
                  L = $[causal_variables_L],
                  estimate_residual_variance = TRUE, 
                  estimate_prior_variance = FALSE,
                  scaled_prior_variance = $[scaled_prior_variance])
                  

    # Get sumstat from the univariate regression
    
    uni = mm_regression(X,Y,center=FALSE,scale=FALSE)
    
    # Save the sumstat objects
    
    uni%>%saveRDS("$[_output[1]]")
    
    # Save the model for future use
    full_model$X = X
    full_model$Y = Y
    save(full_model, file="$[_output[0]]")

## Crossvalidation

In [None]:
# CV with univariate susie
[susie_cv_4]
input: group_by = 2, group_with = 'regions'
output:  f'{wd:a}/susie/{_input[0]:bn}.cv.RData',
         f'{wd:a}/susie/{_input[0]:bn}.cv_diag.RData'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '60G', tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("purrr")
    library("modelr")
    library("susieR")
    
    # Define functions
   
    ## Compute rmse function
    compute_rmse = function(raw,fitted){
    rmse = rep(0,ncol(raw))
    for (i in 1:ncol(raw)){
      rmse[i] = ((fitted - raw)[,i])^2%>%mean(na.rm = TRUE)%>%sqrt() 
      }
    return(rmse)
    }
    
    ## Compute r2 function
    compute_r2 = function(raw,fitted){
      r2 = rep(0,ncol(raw))
      for (j in 1:ncol(raw)){
       r2[j] = summary(lm( as.matrix(fitted[,j]) ~ as.matrix(raw[,j]) ))$adj.r.sq
      }
      return(r2)
    }
    
    ## Compute r2 raw
    
    compute_r2_raw = function(raw,fitted){
      r2 = rep(0,ncol(raw))
      for (j in 1:ncol(raw)){
        r2[j] =  cor(as.matrix(fitted[,j])[which(!is.na(raw[,j]))],raw[,j]%>%na.omit())^2
      }
      return(r2)
    }
    
    ## Get P.value
    compute_pval = function(raw,fitted){
      pval = rep(0,ncol(raw))
      for (k in 1:ncol(raw)){
        pval[k] = summary(lm( fitted[,k]%>%as.matrix ~ raw[,k]%>%as.matrix ))$coef[2,4]
      }
      return(pval)
    }
    
    

    ###Functions to compute MAF and missing genotype rate
    compute_maf <- function(geno){
      f <- mean(geno,na.rm = TRUE)/2
      return(min(f, 1-f))
    }
    
    compute_missing <- function(geno){
      miss <- sum(is.na(geno))/length(geno)
      return(miss)
    }
    
    mean_impute <- function(geno){
      f <- apply(geno, 2, function(x) mean(x,na.rm = TRUE))
      for (i in 1:length(f)) geno[,i][which(is.na(geno[,i]))] <- f[i]
      return(geno)
    }
    
    is_zero_variance <- function(x) {
      if (length(unique(x))==1) return(T)
      else return(F)
    }
    ### Filter X matrix
    filter_X <- function(X, missing_rate_thresh, maf_thresh) {
      rm_col <- which(apply(X, 2, compute_missing) > missing_rate_thresh)
      if (length(rm_col)) X <- X[, -rm_col]
      rm_col <- which(apply(X, 2, compute_maf) < maf_thresh)
      if (length(rm_col)) X <- X[, -rm_col]
      rm_col <- which(apply(X, 2, is_zero_variance))
      if (length(rm_col)) X <- X[, -rm_col]
      return(mean_impute(X))
    }
    
    ### Produce CV dataset
    cv_data_gen = function(X,Y,times,test_prop){
    # Merged the X and Y for producing testing and training set for modelr cv
    cv_df_raw = cbind(X,Y)%>%as_tibble() 
    cv_df = crossv_mc(cv_df_raw, times ,test = test_prop)%>%mutate(
      train_X = map(train,~as_tibble(.x)[1:ncol(X)]%>%as.matrix),
      train_Y = map(train,~as_tibble(.x)[(ncol(X)+1):(ncol(X)+ncol(Y))]%>%as.matrix),
      test_X = map(test,~as_tibble(.x)[1:ncol(X)]%>%as.matrix),
      test_Y = map(test,~as_tibble(.x)[(ncol(X)+1):(ncol(X)+ncol(Y))]%>%as_tibble)
    )  
    
    # Filter Train X with maf and missing, filter test X with the same col as Train X
    cv_df = cv_df%>%mutate(
    train_X = map(train_X,~filter_X(.x,0.1,$[MAF])),
    test_X = map2(test_X,train_X,~.x%>%as_tibble()%>%select(colnames(.y))%>%as.matrix())
    )
    return(cv_df)
    }
    
    # Load Data
    full_model = attach('$[_input[0]]')$full_model
    X = full_model$X
    Y = full_model$Y%>%as.tibble()

    # Create cv dataset
        
    cv_df = cv_data_gen(X,Y,$[cv_times],$[testing_prop])

    # Actual cv
    
    cv_df = cv_df%>%mutate(
    
   
    ## Do susie
    
      susie = pmap(list(train_X,train_Y),function(first,second)(
        
        susie(first, second,
        L = $[causal_variables_L],
        estimate_residual_variance = TRUE, 
        estimate_prior_variance = FALSE,
        scaled_prior_variance = $[scaled_prior_variance])
        )))
    
    # Extract data 
    
    cv_df = cv_df%>%mutate(
      weight = map(susie,~
      (coef(.x)[2:length(coef(.x))])
      ),
      test_fitted = map2(susie,test_X,~predict(.x,.y)%>%as_tibble),
      rmse = map2(test_Y,test_fitted,~compute_rmse(.x,.y)),
      r2 = map2(test_Y,test_fitted,~compute_r2(.x,.y)),
      r2_raw = map2(test_Y,test_fitted,~compute_r2_raw(.x,.y)),
      pval = map2(test_Y,test_fitted,~compute_pval(.x,.y))
    )
    
    # Calculate metrics
    
    mean_rmse = cv_df%>%pull(rmse)%>%as.data.frame()%>%t()%>%as_tibble()%>%colMeans()
    mean_r2 = cv_df%>%pull(r2)%>%as.data.frame()%>%t()%>%as_tibble()%>%colMeans()
    mean_r2_raw = cv_df%>%pull(r2_raw)%>%as.data.frame()%>%t()%>%as_tibble()%>%colMeans()
    mean_pval = cv_df%>%pull(pval)%>%as.data.frame()%>%t()%>%as_tibble()%>%colMeans()

  
    # Save metrics
    full_model$rmse = mean_rmse
    full_model$r2 = mean_r2 
    full_model$r2_raw = mean_r2_raw    
    full_model$pval = mean_pval    
    fitted1 = full_model
    # Save the CV data
    save(cv_df,file = "$[_output[1]]")
    
    #Output
    save(fitted1,file = "$[_output[0]]")
    

## Save output
This step create the all_hsq.txt file to summarize the susie result and create a R object to host all the susieR models.

In [None]:
#Saved selected susie object into one RData
[condense]
input: group_by = "all"
output:f'{wd:a}/susie/all_hsq.txt',
       f'{wd:a}/susie.RData'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'

R: expand= "$[ ]", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("purrr")
    # Load a template
    region = read_delim("$[_output[0]]",delim ="\t")%>%select(path = file, ID = region )
    # get the path
    dir = "$[_input[0]:d]/"
    pre = "$[_input[0]:bnnnnn]"
    sur = ".susie.model.RData"
    region = region%>%mutate(path = map(ID, ~paste(collapse = "", c(dir,pre,".",.x,sur))))
    # Load the data
    output = region%>%mutate(env = map(path,~attach(.x)),
                             model = map(env, ~.x$fitted1))
    # Save the combined output
    save(output,file = "$[_output[1]]")

In [None]:
#Create all_hsq for susie and saved all the susie object into one RData
[susie_4]
input: group_by = "all"
output:f'{wd:a}/susie/all_hsq.txt',
       f'{wd:a}/susie.RData'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "$[ ]"
  head -1 $[_input[0]] > $[_output[0]]
  cat $[wd:a]/susie/*.hsq | grep -v hsq_full_sample | uniq >> $[_output[0]]

R: expand= "$[ ]", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("purrr")
    # Load a template
    region = read_delim("$[_output[0]]",delim ="\t")%>%select(path = file, ID = region )
    # get the path
    dir = "$[_input[0]:d]/"
    pre = "$[_input[0]:bnnnnn]"
    sur = ".susie.model.RData"
    region = region%>%mutate(path = map(ID, ~paste(collapse = "", c(dir,pre,".",.x,sur))))
    # Load the data
    output = region%>%mutate(env = map(path,~attach(.x)),
                             model = map(env, ~.x$fitted1))
    # Save the combined output
    save(output,file = "$[_output[1]]")

In [None]:
#Create all_hsq for susie and saved all the susie object into one RData
[susie_cv_5]
input: group_by = "all"
output:f'{wd:a}/susie/all_hsq.txt',
       f'{wd:a}/susie_cv.RData'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "$[ ]"

  head -1 $[_input[0]:nnn].hsq > $[_output[0]]
  cat $[wd:a]/susie/*.hsq | grep -v hsq_full_sample | uniq >> $[_output[0]]

R: expand= "$[ ]", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout',container = container
    library("dplyr")
    library("tibble")
    library("readr")
    library("purrr")
    # Load a template
    region = read_delim("$[_output[0]]",delim ="\t")%>%select(path = file, ID = region )
    # get the path
    dir = "$[_input[0]:d]/"
    pre = "$[_input[0]:bnnnnn]"
    sur = ".susie.model.cv.RData"
    region = region%>%mutate(path = map(ID, ~paste(collapse = "", c(dir,pre,".",.x,sur))))
    # Load the data
    output = region%>%mutate(env = map(path,~attach(.x)),
                             model = map(env, ~.x$fitted1))
    # Save the combined output
    save(output,file = "$[_output[1]]")