# DENTIST R and cpp cpmparison


## Overview

DENTIST code is implemented in cpp `https://github.com/Yves-CHEN/DENTIST`. To incorporate it in the pecotmr package an R version is developed. 

Here we will test if the R version code have the same result of cpp version.

## Important note

DENTIST does not have a easy way to compile it in the system. Their paths in the makefile is quite messy and we need to install all dependencies and re-wrote the make file, which is quite time consuming. 

However, DENTIST provided an excutable file that can do the same things, but we cannot modify it so that we can only have input and outputs. Given that DENTIST involves randomness in splitting the region into S1 and S2 two parts and iterated this multiple times, so the numerical value of different attempts might be different. So here our strategy is to run DENTIST 10 times in each system, take the average for numerical things and compare if they are similar enough.

## Input

DENTIST the original program needs PLINK file as input, and also sumstat in COJO format (including columns `SNP A1 A2 freq b se p N`). Here we use the first round RSS_QC result as reference, found a region have outliers after allele QC. We use this region to verify the similarity of result.

+ PLINK file: `/home/hs3393/RSS_QC/pecotmr/data/RSS_QC_MWE/PLINK_input_MWE`

+ Sumstat: `/home/hs3393/RSS_QC/pecotmr/data/RSS_QC_MWE/Sumstat_MAF_filtered.txt`


Notes about inputs: DENTIST only allow region with 2000+ variables, also, should not include variants that have maf = 0 in PLINK. So here the plink file is already filtered by threshold 0.01.

## Details

### 1. MWE data preparation

For DENTIST in github https://github.com/Yves-CHEN/DENTIST version 1.3.0.0, it requires PLINK data, so here I prepared it.

In [36]:
library(tidyverse)
library(susieR)
library(plink2R)
library(pecotmr)

# Original sumstat is from Bellenguez GWAS data. 
# I found a region that has some signal and outliers so perfect for MWE.
# Keep ~ 4000 variants because it's demanded by DENTIST

test_data_path = "/home/hs3393/RSS_QC/pecotmr/data/RSS_QC_MWE/Original_sumstat.tsv"
snp_data <- fread(test_data_path)

# get variant list to extract from PLINK data
snp_data %>% mutate(variant_id = paste0("chr", chrom, ":", pos, "_", A2, "_", A1)) %>% pull(variant_id) %>% 
    write.table("/home/hs3393/RSS_QC/pecotmr/data/RSS_QC_MWE/MWE_snplist.txt", row.names = F, quote = F)

In [None]:
plink --bfile /mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/geno_by_chrom/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.1 \ 
    --extract /home/hs3393/RSS_QC/pecotmr/data/RSS_QC_MWE/MWE_snplist.txt \
    --make-bed --maf 0.01 --geno 0.01 --out ~/RSS_QC/pecotmr/data/RSS_QC_MWE/PLINK_input_MWE

In [37]:
genotype = read_plink("~/RSS_QC/pecotmr/data/RSS_QC_MWE/PLINK_input_MWE")

# filter out columns that contain NA so that LD matrix can be computed without error
keep_index = which(!is.na(colSums(genotype$bed)))

names(keep_index) %>% 
    write.table("/home/hs3393/RSS_QC/pecotmr/data/RSS_QC_MWE/MWE_snplist.txt", row.names = F, quote = F)

snp_data %>% mutate(variant_id = paste0("chr", chrom, ":", pos, "_", A2, "_", A1)) %>%
    filter(variant_id %in% names(keep_index)) %>%
    select(variant_id, A1, A2, maf, beta, se,  pvalue ,n_case) %>% rename(SNP = variant_id,
                                                                        freq = maf, b = beta,
                                                                        p = pvalue, N = n_case) %>%
    write_tsv("~/RSS_QC/pecotmr/data/RSS_QC_MWE/Sumstat_MAF_filtered.txt")

In [None]:
# rerun plink to remove those variants containing NA
plink --bfile /mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/geno_by_chrom/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.1 \ 
    --extract /home/hs3393/RSS_QC/pecotmr/data/RSS_QC_MWE/MWE_snplist.txt \
    --make-bed --maf 0.01 --geno 0.01 --out ~/RSS_QC/pecotmr/data/RSS_QC_MWE/PLINK_input_MWE

In [39]:
# calculate LD matrix to run it in rCPP interface 
#sumstat = read_delim("/home/hs3393/RSS_QC/pecotmr/data/RSS_QC_MWE/Sumstat_MAF_filtered.txt")
#geno = read_plink("/home/hs3393//RSS_QC/pecotmr/data/RSS_QC_MWE/PLINK_input_MWE")$bed
#LD = as.matrix(cor(geno))
#write.table(LD, "/home/hs3393//RSS_QC/pecotmr/data/RSS_QC_MWE/LD_MWE.tsv", sep = "\t", quote = FALSE, row.names = TRUE, col.names = TRUE)

#########FINISHED############
sumstat = read_delim("/home/hs3393/RSS_QC/pecotmr/data/RSS_QC_MWE/Sumstat_MAF_filtered.txt") %>% mutate(z = b / se)
LD = fread("/home/hs3393//RSS_QC/pecotmr/data/RSS_QC_MWE/LD_MWE.tsv")[,-1]
LD = as.matrix(LD)

[1mRows: [22m[34m4123[39m [1mColumns: [22m[34m8[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (3): SNP, A1, A2
[32mdbl[39m (5): freq, b, se, p, N

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m4123[39m [1mColumns: [22m[34m8[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (3): SNP, A1, A2
[32mdbl[39m (5): freq, b, se, p, N

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
“Detected 4123 column names but the data has 4124 columns (i.e. invalid file). Added 1 extra default column name for the fir

### 2. DENTIST  -- github compiled version implementation

In [None]:
# DENTIST -- github compiled version

~/RSS_QC/DENTIST/DENTIST_1.3.0.0  --bfile  /home/hs3393//RSS_QC/pecotmr/data/RSS_QC_MWE/PLINK_input_MWE --gwas-summary /home/hs3393/RSS_QC/pecotmr/data/RSS_QC_MWE/Sumstat_MAF_filtered.txt --out DENTIST_result

In [41]:
wc -l ~/RSS_QC/DENTIST/DENTIST_result.DENTIST.short.txt

63 /home/hs3393/RSS_QC/DENTIST/DENTIST_result.DENTIST.short.txt


Here the row number of *.short.txt shows how many outliers are identified in this region and should be removed.

So for DENTIST compiled version, the **number of outlier is 63**. 

### 3. DENTIST -- Rcpp version

In [45]:
dentist_result = dentist(zScore = sumstat$z, LD = LD, nSample = 1153)

In [48]:
dentist_result %>% filter(is_problematic > 0) %>% nrow()

So for DENTIST RCPP version, the **number of outlier is 1053**. The results looks quite different... Take a look at the LD.

In [50]:
library(readr)
library(dplyr, warn.conflicts = FALSE)
library(data.table)
library(RcppArmadillo)
library(Rcpp)
source("/home/rd2972/software/pecotmr/R/run_dentist.R")
source("/home/rd2972/software/pecotmr/R/RcppExports.R")
sourceCpp("/home/rd2972/software/pecotmr/src/dentist.cpp")
sourceCpp("/home/rd2972/software/pecotmr/src/RcppExports.cpp")
dentist_LD_result = fread("/home/rd2972/private_data/20240300_rss_qc_imputation/DENTIST/output_LD.txt")

“No Rcpp::export attributes or RCPP_MODULE declarations found in source”
“Detected 4 column names but the data has 2061 columns (i.e. invalid file). Added 2057 extra default column names at the end.”
“Stopped early on line 2064. Expected 2061 fields but found 0. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<LD_it for iteration :>>”


In [52]:
dim(dentist_LD_result)
head(dentist_LD_result)

LD_it,for,iteration,:,V5,V6,V7,V8,V9,V10,⋯,V2052,V2053,V2054,V2055,V2056,V2057,V2058,V2059,V2060,V2061
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.98501,0.99703,-0.16597,0.98062,0.95174,0.30599,0.19635,0.24287,0.19603,0.20005,⋯,0.01595,0.010116,-0.0094939,-0.0014634,-0.010461,-0.010461,-0.024245,-0.005539,-0.041944,-0.043
0.97722,0.98368,-0.16553,0.99097,0.96237,0.30933,0.18212,0.24568,0.19891,0.1849,⋯,0.025213,0.019561,-0.013084,0.0043563,-0.0073216,-0.0073216,-0.022075,-0.011471,-0.039437,-0.040075
0.94805,0.94244,-0.078334,0.96493,0.99127,0.29998,0.17464,0.23784,0.19103,0.17767,⋯,0.014289,0.0080908,-0.016947,9.1581e-05,-0.0035575,-0.0035575,-0.027552,-0.00036976,-0.045728,-0.047427
0.41409,0.40288,-0.056151,0.41477,0.39906,-0.034997,-0.049078,-0.025289,-0.032538,-0.046423,⋯,0.0057649,-0.00058517,-0.040529,-0.0012314,-0.04817,-0.04817,-0.03419,0.018159,-0.046387,-0.032131
0.58251,0.57434,-0.18227,0.58924,0.60417,0.17331,0.068187,0.16826,0.084623,0.077967,⋯,-0.013174,-0.024045,0.024705,-0.0062777,0.0038312,0.0038312,-0.047474,0.040008,-0.034784,-0.031634
0.28936,0.28088,-0.048627,0.29017,0.27282,0.90044,0.70841,-0.039013,-7.5352e-05,0.72434,⋯,-0.0075276,-0.012857,-0.021822,-0.0032347,-0.032883,-0.032883,-0.02334,0.041117,-0.016986,-0.014725


Ignore the column names. The dimension of LD is not correct yet, also the diagnol elements are not equal to 1.