# PTWAS Implementation in R


This module contains the software implementations to perform transcriptome-wide association analysis (TWAS). These methods are designed to perform rigorous causal inference connecting genes to complex traits. The statistical models and the key algorithms are described in the (manuscript](https://www.biorxiv.org/content/10.1101/808295v1).     
**PTWAS using the ouptut from DAPG and is written by C++. Here we are going to use susie objects instead and convert the codes into R.**

### NOTE 1
double check the following code:
```
susie_df <- data.frame(
                gene_id = gene_name,
                variant = row.names(as.data.frame(susie_susie_obj$pip)),
                weight = round(susie_susie_obj$mu[length(susie_susie_obj$sets$cs),], 4), 
                tissue =  "${tissue}")
```

1. should we use mu or coef.value
2. should we use total length of cs or the corresponding index of CS

## Overview

The goal of this module is to perform PTWAS analysis from SuSiE objects, including:
1. Extract weights from eQTL susie objects. 
2. Conversion of GWAS sumstats to the format with z-scores. 
3. Run PTWAS with R codes.  


### Input
1. QTL susie table：
    - This table has two columns for `moleculart_trait_id` and `susie_file`: target gene and corresponding susie output rds respectively.
2. GWAS sumstats results (tsv format)
3. region list      
#4. LD refrence 

### Ouput
1. susie weight table
2. re-formatted GWAS sumstats results
3. PTWAS results

# Example
We now run an example of this using the vcf file generated from the sample of susie eQTLs.

In [None]:
cd /mnt/vast/hpc/csg/rf2872/Work/INTACT/ptwas_test

sos run pipeline/ptwas.ipynb susie_weight \
    --susie-table eqtl_susie_table_head30.txt  \
    --tissue DLPFC

In [None]:
 sos run pipeline/ptwas.ipynb gwas_ptwas_prep \
    --ptwas_weights output/DLPFC.eQTL.susie_reformat.txt \
    --gwas_basepath /mnt/vast/hpc/csg/ftp_lisanwanglab_sync/ftp_fgc_xqtl/projects/ADGWAS_Bellenguez_2022/ 


In [None]:
sos run pipeline/ptwas.ipynb ptwas_scan_withoutLD \
    --ptwas_weights output/DLPFC.eQTL.susie_reformat.txt \
    --gwas_path output/ADGWAS_Bellenguez_2022.gambitgwas.txt \
    --region_list /mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/dlpfc_region_list  --tissue DLPFC

In [None]:
[global]
# Workdir
parameter: cwd = path("output")
# susie_table is the table of eQTL fine mapped results, which has two columns for gene and susie_fils
parameter: susie_table = ""
# dataset 
parameter: tissue = ''
# QTL data type
parameter: QTL = 'eQTL'
parameter: container = ''
parameter: entrypoint={('micromamba run -n' + ' ' + container.split('/')[-1][:-4]) if container.endswith('.sif') else f''}
parameter: job_size = 1
parameter: walltime = "5h"
parameter: mem = "8G"
parameter: numThreads = 1


## SuSiE weight 

The exsiting pipeline is using mu from SuSiE object. 
If we are going to use coef.value/posterior mean, we can skip this part and use the output from susie_post_processing instead.

In [None]:
[susie_weight_1]
import pandas as pd
df = pd.read_csv(susie_table)
file_paths = []
for i, group in df.groupby(df.index // 100):
    output_file_path = f'{cwd}/cache/{tissue}.{QTL}.chunk_{i}.csv'
    group.to_csv(output_file_path, index=False)
    file_paths.append(output_file_path)  
input: file_paths, group_by = 1
output: f'{cwd}/cache/{_input:bn}.susie_reformat.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    library(susieR)
    library(stringr)
    library(tidyverse)
    
    susie_tbl = read.csv("${_input}", sep = "\t")
    susie_files = susie_tbl$susie_file
    genes = susie_tbl$molecular_trait_id
    
    vcf_out = data.frame(chr=NULL, pos=NULL, var_id=NULL, ref = NULL, alt = NULL, info=NULL)
    cumu_susie_r_df <- data.frame(matrix(nrow = 0, ncol = 4))
    for(i in seq(1, length(genes))) {
        susie_susie_obj <- readRDS(susie_files[i])$dlpfc_eqtl ##FIX
        if ( length(susie_susie_obj$sets$cs)) {
            
            susie_df <- data.frame(
                gene_id = genes[i],
                variant = row.names(as.data.frame(susie_susie_obj$pip)),
                weight = round(susie_susie_obj$mu[length(susie_susie_obj$sets$cs),], 4), 
                tissue =  "${tissue}")
            if (nrow(cumu_susie_r_df) < 1) cumu_susie_r_df <- susie_df else cumu_susie_r_df <- rbind(cumu_susie_r_df, susie_df)
        }
    }
    saveRDS(cumu_susie_r_df,"${_output}")

In [None]:

[susie_weight_2]
input: group_by = 'all'
output: f'{cwd}/{_input[0]:bnnn}.susie_reformat.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    library(susieR)
    library(stringr)
    library(tidyverse)
    cumu_susie_r_df  <- NULL
    all.list <- stringr::str_split("${_input}", " ", simplify = T)
    for (i in all.list) {
        tmp_cumu_susie_r_df <- readRDS(i)
        # Print the feature scores for each condition
        if (!is.null(tmp_cumu_susie_r_df)) {
            if (is.null(cumu_susie_r_df)) {
                cumu_susie_r_df <- tmp_cumu_susie_r_df
            } else {
                cumu_susie_r_df <- rbind(ocumu_susie_r_dfut, tmp_cumu_susie_r_df)
            }
        }
    }
    formatted_r_data <- cumu_susie_r_df %>%
        filter(abs(weight) > 0) %>%
        distinct(gene_id, variant, tissue, .keep_all = TRUE) %>%
        mutate(
            CHR = sapply(strsplit(variant, ":"), "[[", 1),
            chrnum = as.numeric(gsub("chr", "", CHR)),
            pos = as.numeric(sapply(strsplit(variant, "[:|_]")), "[[", 2) )) %>%
        arrange(chrnum, pos) %>%
        select(-chrnum, -CHR, -pos)
    formatted_r_data$weight <- format(formatted_r_data$weight, scientific = FALSE)
    #write.table(cumu_susie_r_df,"${_output}", sep ="\t", quote = F, row.names = F, col.names = F)
    write_delim(formatted_r_data, "${_output}", delim = "\t", append = FALSE, col_names = TRUE, quote = "none")

**output_1**: susie weight table

In [6]:
head output/DLPFC.eQTL.susie_reformat.txt


gene_id	variant	weight	tissue
ENSG00000001460	chr1:24080045_G_A	 0.0641	DLPFC
ENSG00000001461	chr1:24080045_G_A	-0.0159	DLPFC
ENSG00000001460	chr1:24080157_G_A	-0.0335	DLPFC
ENSG00000001461	chr1:24080157_G_A	 0.0243	DLPFC
ENSG00000001460	chr1:24080563_C_T	-0.0210	DLPFC
ENSG00000001461	chr1:24080563_C_T	-0.0038	DLPFC
ENSG00000001460	chr1:24080644_C_T	-0.0335	DLPFC
ENSG00000001461	chr1:24080644_C_T	 0.0243	DLPFC
ENSG00000001460	chr1:24080863_G_A	-0.0357	DLPFC


## GWAS data_prep 



In [None]:
[gwas_ptwas_prep]
parameter: ptwas_weights=""
parameter: gwas_basepath=""
input: ptwas_weights, gwas_basepath
output: f'{cwd}/{_input[1]:bn}.gambitgwas.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    library(tidyverse)
    target_variants <- read_delim(
        "${ptwas_weights}",
        delim = "\t",
        show_col_types = FALSE) %>%
        mutate(id = gsub(":", "_", variant)) %>%
        pull(id) %>%
        unique()

    gwas_basepath <- "${gwas_basepath}"
    gwas <- data.frame(matrix(ncol=12, nrow=0))

    for (gwasfile in list.files(gwas_basepath)) {
        gwaschr <- read_delim(
            paste0(gwas_basepath, gwasfile),
            delim = "\t",
            show_col_types = FALSE) %>%
            filter(variant %in% target_variants)
        gwas <- if (nrow(gwas) < 1) gwaschr else rbind(gwas, gwaschr)
    }

    #gwas$chromosome <- paste0('chr', gwas$chromosome)
    gwas$ZSCORE <- gwas$beta/gwas$se
    gwas$N <- gwas$n_cases + gwas$n_controls
    gwas$SNP_ID <- gwas$variant
    gambitgwas <- gwas %>% subset(select=c("chromosome", "position", "ref", "alt", "SNP_ID", "N", "ZSCORE"))
    colnames(gambitgwas) <- c("#CHR", "POS", "REF", "ALT", "SNP_ID", "N", "ZSCORE")

    write_delim(
        gambitgwas,
        "${_output}",
        delim = "\t",
        append = TRUE,
        col_names = TRUE,
        quote = "none")

**output_2**: re-formatted GWAS sumstats results



In [7]:
head output/ADGWAS_Bellenguez_2022.gambitgwas.txt


#CHR	POS	REF	ALT	SNP_ID	N	ZSCORE
1	24080045	G	A	chr1_24080045_G_A	487511	-2.264705882352941
1	24080157	G	A	chr1_24080157_G_A	487511	2.1585365853658534
1	24080563	C	T	chr1_24080563_C_T	487511	2.012048192771084
1	24080644	C	T	chr1_24080644_C_T	487511	2.1463414634146343
1	24080863	G	A	chr1_24080863_G_A	487511	2.207317073170732
1	24080867	G	A	chr1_24080867_G_A	487511	2.2560975609756095
1	24081747	A	G	chr1_24081747_A_G	487511	2.170731707317073
1	24081924	C	T	chr1_24081924_C_T	487511	2.170731707317073
1	24082275	T	C	chr1_24082275_T_C	487511	2.1585365853658534


## PTWAS scan
This portion contains code for running the PTWAS scan as implemented in GAMBIT. 



### Input

- eQTL Weights
    File susiet contains eQTL weights (formatting is up-for-debate). Maybe column 1 is SNP and column 2 is the weight.
- GWAS Z-Scores
    File that contains GWAS z-scores (or what makes up the z-scores). Column 1 is SNP, column 2 can be z-scores.
- LD reference
- region list


### Output

Same output as GAMBIT

In [None]:
[ptwas_scan_1]
parameter: ptwas_weights=""
parameter: gwas_path=""
parameter: region_list=""
input: ptwas_weights, gwas_path, region_list
output: f'{cwd}/cache/input_dataframe.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    library(tidyverse)
    library(harmonicmeanp)
    library(VariantAnnotation, verbose = FALSE)
    library(snpStats, verbose = FALSE)
    handle_weights <- function(gwas_ids, weight_ids) {
        return(unlist(lapply(
            strsplit(gwas_ids, '[:|_]'),
            function(x) {
                POSs  <- as.double(sapply(strsplit(weight_ids, '[:|_]'), '[[', 2))
                REFs <- sapply(strsplit(weight_ids, '[:|_]'), '[[', 3)
                ALTs <- sapply(strsplit(weight_ids, '[:|_]'), '[[', 4)
                for (i in 1:length(REFs)) {
                    if (as.double(x[2]) == POSs[i]) {
                        if (x[3] == REFs[i] & x[4] == ALTs[i]) {
                            return(1)
                        } else if (x[3] == ALTs[i] & x[4] == REFs[i]) {
                            return(-1)
                        } else if (x[3] == aflip(REFs[i]) & x[4] == aflip(ALTs[i])) {
                            return(1)
                        } else if (x[3] == aflip(ALTs[i]) & x[4] == aflip(REFs[i])) {
                            return(-1)
                        } else {
                            return(0)
                        }
                    }
                }
            }
        )))
    }

    debug_print <- function(gwas_ids, modifiers, weights, zscores, stat, denom, output) {
        snpdf <- data.frame(
            CHROM = sapply(strsplit(gwas_ids, '[:|_]'), '[[', 1),
            POS = as.double(sapply(strsplit(gwas_ids, '[:|_]'), '[[', 2)),
            VARIANT = gsub(":", "_", gwas_ids),
            REF = sapply(strsplit(gwas_ids, '[:|_]'), '[[', 3),
            ALT = sapply(strsplit(gwas_ids, '[:|_]'), '[[', 4),
            MODIFIERS = modifiers,
            WEIGHTS = weights,
            ZSCORES = zscores) %>% arrange(POS)
        cat("\n\nSNP info:\n\n", file = output, sep = "\n", append = TRUE)
        write_delim(
            snpdf,
            output,
            append = TRUE,
            col_names = TRUE,
            quote = "none")
        cat("\n\nmodifiers:\n\n", file = output, sep = "\n", append = TRUE)
        cat(paste(snpdf$MODIFIERS, collapse = " "), file = output, sep = "\n", append = TRUE)
        cat("\n\nweights:\n\n", file = output, sep = "\n", append = TRUE)
        cat(paste(snpdf$WEIGHTS, collapse = " "), file = output, sep = "\n", append = TRUE)
        cat("\n\nz-stats:\n\n", file = output, sep = "\n", append = TRUE)
        cat(paste(snpdf$ZSCORES, collapse = " "), file = output, sep = "\n", append = TRUE)
        cat(paste0("\n\ntest score = ", stat, "\n\n"), file = output, sep = "\n", append = TRUE)
        cat(paste0("\n\ntest variance = ", denom, "\n\n"), file = output, sep = "\n", append = TRUE)
    }

    burden <- function(weight_ids, gwas_ids, uber_ids, genotypeMatrix, ld_uber_ids, weights, zscores) {
        modifiers <- handle_weights(gwas_ids, weight_ids)
        weights <- modifiers * weights

        # Get Variants in Ref Panel, GWAS, and eQTL
        common_variants <- intersect(
            ld_uber_ids,
            uber_ids)

        reducedGenotypeMatrix <- genotypeMatrix$genotypes[, which(ld_uber_ids %in% common_variants)]
        ld_matrix <- ld(reducedGenotypeMatrix, reducedGenotypeMatrix, stats="R")
        
        stat <- t(weights) %*% zscores
        denom <- t(weights) %*% ld_matrix %*% weights
        zscore <- stat/sqrt(denom)

        if (length(zscores) == 1) {
            zscore <- zscores[[1]]
            if (weights[[1]] < 0) {
                zscore <- zscore * -1
            }
        }
        
        pval <- pchisq( zscore * zscore, 1, lower.tail = FALSE)
        
        debug_print(gwas_ids, modifiers, weights, zscores, stat, denom, paste0("${_output:d}", "ptwas-scan.debug"))

        return(pval)
    }
            
    generate_index <- function(variant) {
        return(
            unlist(lapply(
                strsplit(variant, '[:|_]'),
                function(x) {
                    alleles <- list(c(x[3], x[4]), c(aflip(x[3]), aflip(x[4])), c(x[4], x[3]), c(aflip(x[4]), aflip(x[3])))
                    alleles <- alleles[order(sapply(alleles, '[[', 1))]
                    return(
                        paste(
                            c(
                                x[1],
                                x[2],
                                paste(sapply(alleles, '[[', 1), collapse = "|"),
                                paste(sapply(alleles, '[[', 2), collapse = "|")
                            ),
                            collapse = ":"
                        ))
                })))
    }
            
    aflip <- function(allele) {
        if( allele == "A" ) {
            return("T")
        }
        else if( allele == "C" ) {
            return("G")
        }
        else if( allele == "T" ) {
            return("A")
        }
        else if( allele == "G" ) {
            return("C")
        }
        else {
            return("")
        }
    }

    ptwas_weights <- read_delim(
        "${ptwas_weights}",
        delim = "\t",
        show_col_types = FALSE) %>% 
        mutate(
            uber_id = generate_index(variant))
    gwas <- read_delim(
        "${gwas_path}", delim = "\t", comment = "##", show_col_types = FALSE) %>%
        dplyr::rename(Z = ZSCORE) %>%
        mutate(
            `#CHR` = as.character(`#CHR`),             
            `#CHR` = ifelse(!startsWith(`#CHR`, "chr"), paste0("chr", `#CHR`), `#CHR`),
            SNP_ID = gsub("_", ":", SNP_ID),
            uber_id = generate_index(SNP_ID))

    regionlist <- read_delim(
        "${region_list}",
        delim = "\t",
        show_col_types = FALSE)
    colnames(regionlist) <- c("#CHR", "start", "end", "gene_id", "gene_name")

    input_dataframe <- merge(
        regionlist,
        merge(ptwas_weights, gwas, by = c("uber_id"), all = FALSE),
        by = c("#CHR", "gene_id"), all = FALSE)

    write.table(input_dataframe, "${_output}", col.names = T, row.names = F, quote = F, sep='\t')    
    



In [None]:
[ptwas_scan_2]
parameter: vcf_path_prefix="/mnt/vast/hpc/csg/rf2872/data/1000Genome/LD_panel_Travyse_NoQualityGuarantee/EUR."
parameter: vcf_path_suffix=".shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz"
parameter: region_list=""
parameter: ldrefbuild="hg38"
import pandas as pd
# Extract unique values, remove 'chr', convert to int, sort, and then add 'chr' back
input_df = file_target(f'{cwd}/cache/input_dataframe.txt')
input_dataframe = pd.read_csv(f'{cwd}/cache/input_dataframe.txt', sep='\t')
chroms = ["chr" + str(x) for x in sorted([int(chrom.replace("chr", "")) for chrom in input_dataframe["#CHR"].unique()])]
chroms_LD = [f"{vcf_path_prefix}{chrom}{vcf_path_suffix}" for chrom in chroms]

input: chroms_LD, group_by = 1
output: f'{cwd}/{tissue}.{_input}.ptwas.output'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    library(tidyverse)
    library(harmonicmeanp)
    library(VariantAnnotation, verbose = FALSE)
    library(snpStats, verbose = FALSE)
    handle_weights <- function(gwas_ids, weight_ids) {
        return(unlist(lapply(
            strsplit(gwas_ids, '[:|_]'),
            function(x) {
                POSs  <- as.double(sapply(strsplit(weight_ids, '[:|_]'), '[[', 2))
                REFs <- sapply(strsplit(weight_ids, '[:|_]'), '[[', 3)
                ALTs <- sapply(strsplit(weight_ids, '[:|_]'), '[[', 4)
                for (i in 1:length(REFs)) {
                    if (as.double(x[2]) == POSs[i]) {
                        if (x[3] == REFs[i] & x[4] == ALTs[i]) {
                            return(1)
                        } else if (x[3] == ALTs[i] & x[4] == REFs[i]) {
                            return(-1)
                        } else if (x[3] == aflip(REFs[i]) & x[4] == aflip(ALTs[i])) {
                            return(1)
                        } else if (x[3] == aflip(ALTs[i]) & x[4] == aflip(REFs[i])) {
                            return(-1)
                        } else {
                            return(0)
                        }
                    }
                }
            }
        )))
    }

    debug_print <- function(gwas_ids, modifiers, weights, zscores, stat, denom, output) {
        snpdf <- data.frame(
            CHROM = sapply(strsplit(gwas_ids, '[:|_]'), '[[', 1),
            POS = as.double(sapply(strsplit(gwas_ids, '[:|_]'), '[[', 2)),
            VARIANT = gsub(":", "_", gwas_ids),
            REF = sapply(strsplit(gwas_ids, '[:|_]'), '[[', 3),
            ALT = sapply(strsplit(gwas_ids, '[:|_]'), '[[', 4),
            MODIFIERS = modifiers,
            WEIGHTS = weights,
            ZSCORES = zscores) %>% arrange(POS)
        cat("\n\nSNP info:\n\n", file = output, sep = "\n", append = TRUE)
        write_delim(
            snpdf,
            output,
            append = TRUE,
            col_names = TRUE,
            quote = "none")
        cat("\n\nmodifiers:\n\n", file = output, sep = "\n", append = TRUE)
        cat(paste(snpdf$MODIFIERS, collapse = " "), file = output, sep = "\n", append = TRUE)
        cat("\n\nweights:\n\n", file = output, sep = "\n", append = TRUE)
        cat(paste(snpdf$WEIGHTS, collapse = " "), file = output, sep = "\n", append = TRUE)
        cat("\n\nz-stats:\n\n", file = output, sep = "\n", append = TRUE)
        cat(paste(snpdf$ZSCORES, collapse = " "), file = output, sep = "\n", append = TRUE)
        cat(paste0("\n\ntest score = ", stat, "\n\n"), file = output, sep = "\n", append = TRUE)
        cat(paste0("\n\ntest variance = ", denom, "\n\n"), file = output, sep = "\n", append = TRUE)
    }

    burden <- function(weight_ids, gwas_ids, uber_ids, genotypeMatrix, ld_uber_ids, weights, zscores) {
        modifiers <- handle_weights(gwas_ids, weight_ids)
        weights <- modifiers * weights

        # Get Variants in Ref Panel, GWAS, and eQTL
        common_variants <- intersect(
            ld_uber_ids,
            uber_ids)

        reducedGenotypeMatrix <- genotypeMatrix$genotypes[, which(ld_uber_ids %in% common_variants)]
        ld_matrix <- ld(reducedGenotypeMatrix, reducedGenotypeMatrix, stats="R")
        
        stat <- t(weights) %*% zscores
        denom <- t(weights) %*% ld_matrix %*% weights
        zscore <- stat/sqrt(denom)

        if (length(zscores) == 1) {
            zscore <- zscores[[1]]
            if (weights[[1]] < 0) {
                zscore <- zscore * -1
            }
        }
        
        pval <- pchisq( zscore * zscore, 1, lower.tail = FALSE)
        
        debug_print(gwas_ids, modifiers, weights, zscores, stat, denom, paste0("${_output:d}", "ptwas-scan.debug"))

        return(pval)
    }
            
    generate_index <- function(variant) {
        return(
            unlist(lapply(
                strsplit(variant, '[:|_]'),
                function(x) {
                    alleles <- list(c(x[3], x[4]), c(aflip(x[3]), aflip(x[4])), c(x[4], x[3]), c(aflip(x[4]), aflip(x[3])))
                    alleles <- alleles[order(sapply(alleles, '[[', 1))]
                    return(
                        paste(
                            c(
                                x[1],
                                x[2],
                                paste(sapply(alleles, '[[', 1), collapse = "|"),
                                paste(sapply(alleles, '[[', 2), collapse = "|")
                            ),
                            collapse = ":"
                        ))
                })))
    }
            
    aflip <- function(allele) {
        if( allele == "A" ) {
            return("T")
        }
        else if( allele == "C" ) {
            return("G")
        }
        else if( allele == "T" ) {
            return("A")
        }
        else if( allele == "G" ) {
            return("C")
        }
        else {
            return("")
        }
    }
    LDvcf <- readVcf(
        "${_input}",  "${ldrefbuild}")

    # Create Genotype SNPMatrix
    genotypeMatrix <- genotypeToSnpMatrix(LDvcf)

    # Get Variants Present in Ref Panel
    ld_variants <- generate_index(paste0("chr", colnames(genotypeMatrix$genotypes)))

    results <- input_dataframe %>%
        filter(uber_id %in% ld_variants) %>%
        mutate(weight = as.double(weight)) %>%
        group_by(gene_id, tissue) %>%
        mutate(
            nsnps = length(variant),
            burden_pval = burden(variant, SNP_ID, uber_id, genotypeMatrix, ld_variants, weight, Z)) %>%
        ungroup()
    
    write_delim(
        results %>%
            subset(select=c("gene_id", "tissue", "nsnps", "burden_pval", "global_pval", "naive_pval", "min_pval")) %>%
            distinct(gene_id, tissue, .keep_all = TRUE),
        "{_output}",
        delim = "\t",
        append = FALSE,
        quote = "none")
    

In [None]:
[ptwas_scan_withoutLD]
parameter: ptwas_weights=""
parameter: gwas_path=""
parameter: region_list=""
output: f'{cwd}/{tissue}.ptwas.output'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    library(tidyverse)
    library(harmonicmeanp)
    handle_weights <- function(gwas_ids, weight_ids) {
        return(unlist(lapply(
            strsplit(gwas_ids, '[:|_]'),
            function(x) {
                POSs  <- as.double(sapply(strsplit(weight_ids, '[:|_]'), '[[', 2))
                REFs <- sapply(strsplit(weight_ids, '[:|_]'), '[[', 3)
                ALTs <- sapply(strsplit(weight_ids, '[:|_]'), '[[', 4)
                for (i in 1:length(REFs)) {
                    if (as.double(x[2]) == POSs[i]) {
                        if (x[3] == REFs[i] & x[4] == ALTs[i]) {
                            return(1)
                        } else if (x[3] == ALTs[i] & x[4] == REFs[i]) {
                            return(-1)
                        } else if (x[3] == aflip(REFs[i]) & x[4] == aflip(ALTs[i])) {
                            return(1)
                        } else if (x[3] == aflip(ALTs[i]) & x[4] == aflip(REFs[i])) {
                            return(-1)
                        } else {
                            return(0)
                        }
                    }
                }
            }
        )))
    }

    debug_print <- function(gwas_ids, modifiers, weights, zscores, stat, denom, output) {
        snpdf <- data.frame(
            CHROM = sapply(strsplit(gwas_ids, '[:|_]'), '[[', 1),
            POS = as.double(sapply(strsplit(gwas_ids, '[:|_]'), '[[', 2)),
            VARIANT = gsub(":", "_", gwas_ids),
            REF = sapply(strsplit(gwas_ids, '[:|_]'), '[[', 3),
            ALT = sapply(strsplit(gwas_ids, '[:|_]'), '[[', 4),
            MODIFIERS = modifiers,
            WEIGHTS = weights,
            ZSCORES = zscores) %>% arrange(POS)
        cat("\n\nSNP info:\n\n", file = output, sep = "\n", append = TRUE)
        write_delim(
            snpdf,
            output,
            append = TRUE,
            col_names = TRUE,
            quote = "none")
        cat("\n\nmodifiers:\n\n", file = output, sep = "\n", append = TRUE)
        cat(paste(snpdf$MODIFIERS, collapse = " "), file = output, sep = "\n", append = TRUE)
        cat("\n\nweights:\n\n", file = output, sep = "\n", append = TRUE)
        cat(paste(snpdf$WEIGHTS, collapse = " "), file = output, sep = "\n", append = TRUE)
        cat("\n\nz-stats:\n\n", file = output, sep = "\n", append = TRUE)
        cat(paste(snpdf$ZSCORES, collapse = " "), file = output, sep = "\n", append = TRUE)
        cat(paste0("\n\ntest score = ", stat, "\n\n"), file = output, sep = "\n", append = TRUE)
        cat(paste0("\n\ntest variance = ", denom, "\n\n"), file = output, sep = "\n", append = TRUE)
    }

    burden <- function(weight_ids, gwas_ids, weights, zscores) {
        modifiers <- handle_weights(gwas_ids, weight_ids)
        weights <- modifiers * weights
        
        stat <- t(weights) %*% zscores
        denom <- t(weights) %*% weights
        zscore <- stat/sqrt(denom)

        if (length(zscores) == 1) {
            zscore <- zscores[[1]]
            if (weights[[1]] < 0) {
                zscore <- zscore * -1
            }
        }
        
        pval <- pchisq( zscore * zscore, 1, lower.tail = FALSE)
        
        debug_print(gwas_ids, modifiers, weights, zscores, stat, denom, paste0("${_output:d}", "ptwas-scan.debug"))

        return(pval[1])
    }
        
            
    generate_index <- function(variant) {
        return(
            unlist(lapply(
                strsplit(variant, '[:|_]'),
                function(x) {
                    alleles <- list(c(x[3], x[4]), c(aflip(x[3]), aflip(x[4])), c(x[4], x[3]), c(aflip(x[4]), aflip(x[3])))
                    alleles <- alleles[order(sapply(alleles, '[[', 1))]
                    return(
                        paste(
                            c(
                                x[1],
                                x[2],
                                paste(sapply(alleles, '[[', 1), collapse = "|"),
                                paste(sapply(alleles, '[[', 2), collapse = "|")
                            ),
                            collapse = ":"
                        ))
                })))
    }
            
    aflip <- function(allele) {
        if( allele == "A" ) {
            return("T")
        }
        else if( allele == "C" ) {
            return("G")
        }
        else if( allele == "T" ) {
            return("A")
        }
        else if( allele == "G" ) {
            return("C")
        }
        else {
            return("")
        }
    }


    ptwas_weights <- read_delim(
        "${ptwas_weights}",
        delim = "\t",
        show_col_types = FALSE) %>% 
        mutate(
            uber_id = generate_index(variant))
    gwas <- read_delim(
        "${gwas_path}", delim = "\t", comment = "##", show_col_types = FALSE) %>%
        rename(Z = ZSCORE) %>%
        mutate(
            `#CHR` = as.character(`#CHR`),             
            `#CHR` = ifelse(!startsWith(`#CHR`, "chr"), paste0("chr", `#CHR`), `#CHR`),
            SNP_ID = gsub("_", ":", SNP_ID),
            uber_id = generate_index(SNP_ID))

    regionlist <- read_delim(
        "${region_list}",
        delim = "\t",
        show_col_types = FALSE)
    colnames(regionlist) <- c("#CHR", "start", "end", "gene_id", "gene_name")

    input_dataframe <- merge(
        regionlist,
        merge(ptwas_weights, gwas, by = c("uber_id"), all = FALSE),
        by = c("#CHR", "gene_id"), all = FALSE)

    results <- input_dataframe %>%
        mutate(weight = as.double(weight)) %>%
        group_by(gene_id, tissue) %>%
        mutate(
            nsnps = length(variant),
            burden_pval = burden(variant, SNP_ID, weight, Z)) %>%
        ungroup()

    
    write_delim(
        results %>%
            subset(select=c("gene_id", "tissue", "nsnps", "burden_pval")) %>%
            distinct(gene_id, tissue, .keep_all = TRUE),
        "${_output}",
        delim = "\t",
        append = FALSE,
        quote = "none")
        



**output_3**: ptwas results


In [8]:
head output/DLPFC.ptwas.output


gene_id	tissue	nsnps	burden_pval
ENSG00000000457	DLPFC	3023	0.37256142427110756
ENSG00000000971	DLPFC	7465	0.35697314943086955
ENSG00000001460	DLPFC	1999	7.006837334425248e-71
ENSG00000001461	DLPFC	1999	9.897049424129262e-51
ENSG00000002016	DLPFC	2710	7.302886702775239e-20
ENSG00000003056	DLPFC	6052	3.7933865595690386e-23
ENSG00000003249	DLPFC	2747	4.742781958897929e-131
ENSG00000002834	DLPFC	4430	1.1702859075378423e-74
ENSG00000003137	DLPFC	4655	1.8083016266006384e-55
