<style>
p {
  line-height: 1.5;
}
</style>

# PTWAS Estimation and Model Validation Implementation in R

The procedures for PTWAS estimation and model validation that are implemented in the R script. The unit of analysis is a single gene-trait pair, whose information is summarized in a text file. First, we generate the units of analysis. Then, we conduct the causal effect estimation and validation procedure.

## Overview

* Organize the eQTL fine-mapping results, eQTL summary statistics, and GWAS summary statistics data into a format that can be processed by ptwas_est.  

* run ptwas_est with R codes: causal effect estimation and validation.

### Input

* eQTL finemapping result:  

    *   This result contain the information of credible sets and pip. <br/>
<br/>
* eqtl sumstats: se and beta

* gwas sumstats: se and beta

### Output

* data format for ptwas_est

* ptwas_est resuls

## Example

### Generate the units of analysis

####  Input
* **eQTL finemapping**: /mnt/vast/hpc/csg/rf2872/Work/INTACT/fastenloc_test/eqtl_susie_table.txt

* **eQTL sumstats**: /mnt/vast/hpc/csg/ftp_lisanwanglab_sync/ftp_fgc_xqtl/projects/rna-seq/BU/ROSMAP_DLPFC/eQTL/association_scan/norminal_qced_files

* **GWAS sumstats**: /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022

#### Output file format

 Each row of the file corresponds to a single SNP that is inferred as a memeber of a signal cluster, the columns are:

* snp: SNP name

* beta_eQTL: eQTL effect

* se_eQTL: standard error of estimated eQTL effect

* beta_GWAS: GWAS effect

* se_GWAS: standard error of GWAS effect

* cluster: Signal cluster ID (credible sets index)

* pip: SNP posterior inclusiion probability (PIP)

* gene_id: gene name

In [3]:
#######organize the data of eqtl finemapping, eqtl summary statistics and GWAS summary statistics
library(data.table)
library(tidyverse)
library(stringr)

#####load the the eqtl finemapping folders
eqtl_table_list = fread("/mnt/vast/hpc/csg/rf2872/Work/INTACT/fastenloc_test/eqtl_susie_table.txt")
finemapping_folder = "/mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad/cache"

#####load the eqtl sumstats
eqtl_folder = "/mnt/vast/hpc/csg/ftp_lisanwanglab_sync/ftp_fgc_xqtl/projects/rna-seq/BU/ROSMAP_DLPFC/eQTL/association_scan/norminal_qced_files"
#####load the GWAS sumstats
GWAS_folder = "/mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022"

gene_name_list = eqtl_table_list$molecular_trait_i
eqtl = data.frame()
for (k in 1:100){
gene_name = gene_name_list[k]
eqtl_finemapp = readRDS(paste0(finemapping_folder,"/demo.",gene_name,".unisusie.fit.rds", sep=""))
cluster_num = eqtl_finemapp$dlpfc_eqtl$sets$cs_index
if (!is.null(cluster_num)){
    chr = str_sub(unique(str_split(eqtl_finemapp$dlpfc_eqtl$variable_name,":",simplify=T)[,1]),4)
    eqtl_data = fread(file = paste0(eqtl_folder,"/dlpfc_batch_all.rnaseqc.low_expression_filtered.outlier_removed.tmm.expression.bed.processed_phenotype.per_chrom_dlpfc_batch_all.rnaseqc.ROSMAP_covariates.ROSMAP_NIA_WGS.pca.PEER.txt.",chr,".norminal.cis_long_table.txt",sep=""))
    GWAS_data = fread(paste0(GWAS_folder,"/ADGWAS_Bellenguez_2022.",chr,"/ADGWAS2022.chr",chr,".sumstat.tsv",sep=""))
    for (i in 1:length(cluster_num)){
       cluster_id = cluster_num[i]
       cluster_id_name  = paste0("L",cluster_id,sep="")
       crediblesets = eqtl_finemapp$dlpfc_eqtl$sets$cs[[cluster_id_name]]
       variants_name = eqtl_finemapp$dlpfc_eqtl$variable_name[crediblesets]
       #####get the chromosome fo the variants_name
       chr = str_sub(unique(str_split(variants_name,":",simplify=T)[,1]),4)
       pip = gsub(":","",eqtl_finemapp$dlpfc_eqtl$pip[crediblesets])
       eqtl_cluster = tibble(snp = variants_name, cluster  = rep(cluster_id, length(variants_name)),pip = pip,gene_id = rep(gene_name,length(variants_name)))
       eqtl_beta_se = eqtl_data%>%filter(molecular_trait_id%in%gene_name)%>%filter(variant%in%variants_name)%>%select(variant,beta,se)%>%rename("snp" = "variant","beta_eQTL"="beta","se_eQTL"="se")
       GWAS_beta_se = GWAS_data%>%mutate(snp = sub("_",":",variant))%>%filter(snp%in%variants_name)%>%select(snp,beta,se)%>%rename("beta_GWAS"="beta","se_GWAS"="se")
       df = list(eqtl_beta_se, GWAS_beta_se, eqtl_cluster)%>%reduce(full_join, by = "snp")%>%drop_na()
       eqtl = rbind(eqtl, df)
    }
    rm(eqtl_data)
    rm(GWAS_data)
    fwrite(eqtl,file  = "/home/aw3600/PTWAS/PTWAS_test_data.txt")
}
}

### Causal effect estimation and validation

#### Estimation
* First, we select strong eQTLs by thresholding on the corresponding signal-level PIPs (cpip = 0.5).

* Then, we aggregate the estimates from the individual eQTLs using a fixed-effect meta-analysis procedure.

#### Validation
Identify severe violations of the exclusion restriction (ER) assumption and exclude them from the downstream estimation procedure.

#### Inuput file format
The output from the first step


#### Output file format
Each row of the file corresponds to a gene and the columns are:

* gene_id: gene name

* num_cluster: the number of credible sets of gene

* num_instruments: the number of instruments included in the gene 

* spip: the sum of pip for credible sets of each gene

* grp_beta: signal-level estimates (combine the SNP-level estimates from the member SNPs weighted by their corresponding PIPs)

* grp_se: standard error of the signal-level estimates

* meta: gene-level estimate of the causal effect (aggregate the signal-level estimates by applying a fixed-effect meta-analysis model)

* se_meta: standard error of the gene-level estimate of the causal effect

* Q: Cochran’s Q statistic

* I2: $I^2$ statistics (quantify the heterogeneity, range from 0 to 1, $I^2 \rightarrow 0$ indicates reasonable consistency among estimates by multiple eQTLs, whereas $I^2 \rightarrow 1$ 1 implies severe departures from exclusion restriction (ER).)

#### Input

In [3]:
formatted_input = fread("/home/aw3600/PTWAS/PTWAS_test_data.txt")
head(formatted_input)

snp,beta_eQTL,se_eQTL,beta_GWAS,se_GWAS,cluster,pip,gene_id
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<chr>
chr1:169880000_A_G,-0.2883383,0.03356421,0.0031,0.0137,1,0.020186373,ENSG00000000457
chr1:169880823_G_A,-0.2883383,0.03356421,0.0034,0.0137,1,0.020186373,ENSG00000000457
chr1:169880877_T_C,-0.2883383,0.03356421,0.0033,0.0137,1,0.020186373,ENSG00000000457
chr1:169882224_A_G,-0.2859217,0.03388142,0.0024,0.0138,1,0.003048243,ENSG00000000457
chr1:169882228_G_A,-0.2859217,0.03388142,0.0023,0.0138,1,0.003048243,ENSG00000000457
chr1:169882581_G_T,-0.2883383,0.03356421,0.0025,0.0137,1,0.020186373,ENSG00000000457


#### ptwas_est

In [6]:
#####define the threshold of the cumulative cluster pip
cpip = 0.5

######heterogeneity:  calculate I2 statistics based on the Cochran's Q statistic
calc_I2 = function(Q, Est) {
    Q = Q[[1]]
    Est = length(unique(Est))
    I2 = if (Q > 1e-3) (Q - Est + 1)/Q else 0
    return(if (I2 < 0) 0 else I2)
}

#####ptwas_est
output = formatted_input %>%
    mutate(
        beta_eQTL = beta_eQTL/se_eQTL,
        se_eQTL = 1) %>%
    group_by(gene_id, cluster) %>%
    mutate(spip = sum(pip)) %>%
    filter(spip >= cpip) %>% # Cumulative cluster pip greater than a user defined cumulative pip threshold
    group_by(gene_id, cluster) %>%
    mutate(
        beta_yx = beta_GWAS/beta_eQTL,
        se_yx = sqrt(
            (se_GWAS^2/beta_eQTL^2) + ((beta_GWAS^2*se_eQTL^2)/beta_eQTL^4)),
        grp_beta = sum((beta_yx*pip)/spip),
        grp_se = sum((beta_yx^2 + se_yx^2)*pip/spip)) %>%
    mutate(
        grp_se = sqrt(grp_se - grp_beta^2),
        wv = grp_se^-2) %>%
    ungroup() %>%
    group_by(gene_id) %>%
    mutate(
        meta = sum(unique(wv) * unique(grp_beta)),
        sum_w = sum(unique(wv)),
        se_meta = sqrt(sum_w^-1),
        num_cluster = length(unique(cluster))) %>%
    mutate(
        num_instruments = length(snp),
        meta = meta/sum_w,
        ######sum(unique(wv)*(unique(grp_beta) - unique(meta))^2)
        Q = sum(unique(wv)*(unique(grp_beta) - unique(meta))^2),
        I2 = calc_I2(Q, grp_beta)) %>%
        ungroup() %>%
    distinct(gene_id, .keep_all = TRUE) %>%
    mutate(
        spip = round(spip, 3),
        grp_beta = round(grp_beta, 3),
        meta = round(meta, 3),
        se_meta = round(se_meta, 3),
        Q = round(Q, 3),
        I2 = round(I2, 3)) %>%
    select(gene_id, num_cluster, num_instruments, spip, grp_beta, grp_se, meta, se_meta, Q, I2)
    fwrite(output,file  = "/home/aw3600/PTWAS/PTWAS_est_data.txt")

#### Output

In [7]:
output

gene_id,num_cluster,num_instruments,spip,grp_beta,grp_se,meta,se_meta,Q,I2
<chr>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000000457,2,74,0.696,0.0,0.0016118763,-0.001,0.001,0.157,0.0
ENSG00000000971,1,22,0.739,-0.001,0.0010414938,-0.001,0.001,0.0,0.0
ENSG00000001084,1,10,0.881,0.0,0.0010095378,0.0,0.001,0.0,0.0
ENSG00000001167,1,16,0.737,0.009,0.0029981468,0.009,0.003,0.0,0.0
ENSG00000001460,1,16,0.745,-0.001,0.0006707913,-0.001,0.001,0.0,0.0
ENSG00000001461,1,47,0.679,-0.005,0.0028313411,-0.005,0.003,0.0,0.0
ENSG00000001561,1,9,0.586,0.001,0.0006746664,0.001,0.001,0.0,0.0
ENSG00000001626,1,28,0.801,0.001,0.0011200443,0.001,0.001,0.0,0.0
ENSG00000001629,2,48,0.678,0.0,0.0019744363,0.0,0.001,0.046,0.0
ENSG00000001630,1,67,0.807,0.001,0.0011870369,0.001,0.001,0.0,0.0
