# This is a example to test if our padding algorithm works

## Overview: 

Padding is to solve the problem that the finemapping regions between GWAS and xqtl are not the same. For GWAS the finemapping regions are usually LD blocks, but for xqtl it's designed to be on TAD/TADB/cis window/ extended cis window etc. Here based on the nature that LD blocks is considered as independent from each other, we can paste the result (lbf matrix) multiple LD blocks to match the region of our xqtl finemapping. For those columns that are not shared between qtl and GWAS data, we will remove these columns.

To paste out a new large matrix for GWAS lbf, our strategy is: if pasting involves 2 LD blocks for example, we will fill up the left upper part with lbf matrix from LD1, then the lower right part with lbf matrix from LD2, the rest of them we will fill up with 0. 

Then after we cook up the new lbf matrices both for qtl and GWAS, we need to re-compute the alpha matrix and PIP. For eqtl credible sets may require do finemapping again. for now we choose a simple strategy: just remove the variant in any credible sets if they are not shared by the two(or more) datasets.

## Input 1: GWAS finemapping result folder

Here I used our ADHD GWAS finemapping result: `/home/hs3393/ADHD/ADHD_finemap`

## Input 2: qtl finemapping result 

Better use `ls /mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad/cache/*rds` as input, so that we can create job number = gene number.

This is the eqtl finemapping result by Hao on AD.

## Step details:

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.2     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.1     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
# read all GWAS finemapping result file name to extract the chr, start, end and corresponding file paths  

gwas_finemap_result = list.files("~/ADHD/ADHD_finemap", full.names=T, pattern = "\\.rds$")

gwas_file_tb = tibble(file_path = gwas_finemap_result)

# use regex to extract the start, end and chromosome of LD blocks from the file name
match_pattern = function(filename){
    pattern <- "chr[0-9XY]+_(\\d+_\\d+)"
    result = regmatches(filename, regexpr(pattern, filename))
    return(result)
}
  
LD_block_position = gwas_file_tb %>% mutate(chr_pos = map(file_path, match_pattern)) %>% 
  separate(chr_pos, into = c("chr", "start", "end")) %>% relocate(file_path, .after = end) %>%
    mutate(start = as.numeric(start), end = as.numeric(end)) %>% arrange(chr, start)

In [3]:
head(LD_block_position)

chr,start,end,file_path
<chr>,<dbl>,<dbl>,<chr>
chr1,16103,2888443,/home/hs3393/ADHD/ADHD_finemap/ADHD_sumstat_hg38_qc.chr1.chr1_16103_2888443.unisusie_rss.fit.rds
chr1,2888443,4320284,/home/hs3393/ADHD/ADHD_finemap/ADHD_sumstat_hg38_qc.chr1.chr1_2888443_4320284.unisusie_rss.fit.rds
chr1,4320284,5853833,/home/hs3393/ADHD/ADHD_finemap/ADHD_sumstat_hg38_qc.chr1.chr1_4320284_5853833.unisusie_rss.fit.rds
chr1,5853833,7110219,/home/hs3393/ADHD/ADHD_finemap/ADHD_sumstat_hg38_qc.chr1.chr1_5853833_7110219.unisusie_rss.fit.rds
chr1,7110219,9473386,/home/hs3393/ADHD/ADHD_finemap/ADHD_sumstat_hg38_qc.chr1.chr1_7110219_9473386.unisusie_rss.fit.rds
chr1,9473386,11328222,/home/hs3393/ADHD/ADHD_finemap/ADHD_sumstat_hg38_qc.chr1.chr1_9473386_11328222.unisusie_rss.fit.rds


# Part 2: input - qtl finemapping result files (rds format)

In [4]:
qtl_list = list.files("/mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad/cache", full.names=T, pattern = "\\.rds$")
qtl_file = readRDS(qtl_list[100])

In [5]:
qtl_chr = regmatches(qtl_file$dlpfc_eqtl$variable_name[1], regexpr("(chr[0-9]+)", qtl_file$dlpfc_eqtl$variable_name[1]))
qtl_start  <- as.numeric(sub(".*:(\\d+)_.*", "\\1", qtl_file$dlpfc_eqtl$variable_name[1]))
qtl_end  <- as.numeric(sub(".*:(\\d+)_.*", "\\1", qtl_file$dlpfc_eqtl$variable_name[length(qtl_file$dlpfc_eqtl$variable_name)]))

In [6]:
qtl_chr
qtl_start 
qtl_end 

In [8]:
# retract related files in GWAS
related_LD = LD_block_position %>% filter(chr == qtl_chr) %>% filter((start <= qtl_start & end >= qtl_start) |
                                         (start >= qtl_start & end <= qtl_end) | 
                                         (start <= qtl_end & end >= qtl_end))

In [9]:
related_LD

chr,start,end,file_path
<chr>,<dbl>,<dbl>,<chr>
chr7,73724576,77106024,/home/hs3393/ADHD/ADHD_finemap/ADHD_sumstat_hg38_qc.chr7.chr7_73724576_77106024.unisusie_rss.fit.rds
chr7,77106024,78547096,/home/hs3393/ADHD/ADHD_finemap/ADHD_sumstat_hg38_qc.chr7.chr7_77106024_78547096.unisusie_rss.fit.rds
chr7,78547096,80918333,/home/hs3393/ADHD/ADHD_finemap/ADHD_sumstat_hg38_qc.chr7.chr7_78547096_80918333.unisusie_rss.fit.rds


## Step 3: cook up new lbf matrix

In [10]:
# extract those related GWAS finemapping result lbf matrix to form the larger one
cnt = 1
variants = c()
lbf_mtx = list()
for (file in related_LD$file_path){
    rds = readRDS(file)
    variants = c(variants, rds$variants)
    lbf_mtx[[cnt]] = as.data.frame(rds$lbf_variable)
    colnames(lbf_mtx[[cnt]]) = rds$variants
    cnt = cnt + 1
}

# after combining the matrices, fill those NA with 0 to form the whole matrix
lbf_whole_mtx = bind_rows(lbf_mtx) %>% replace(is.na(.), 0)

# get the shared variants between gwas and qtl

# here is one problem: maybe now the output of finemapping does not need $dlpfc... and the variant name have different format
# chr:9999 or chr_999, so here we change : to _; this brings a lot of trouble
# if now variable name are uniformed, then things are good, remove the str_replcace line
  
shared_variant = unlist(intersect(qtl_file$dlpfc_eqtl$variable_name %>% 
  map(~ str_replace_all(.x, ":", "_")), colnames(lbf_whole_mtx)))

  
# remove those columns that does not share SNP, only keep those shared by two dataset
GWAS_lbf_matrix = lbf_whole_mtx[, shared_variant]
  

# again, change the variant names
colnames(qtl_file$dlpfc_eqtl$lbf_variable) = unlist(qtl_file$dlpfc_eqtl$variable_name %>% 
  map(~ str_replace_all(.x, ":", "_")))

# because in the susie output the cs are recorded by index, so we get the index that are removed
rm_index = which(!(colnames(qtl_file$dlpfc_eqtl$lbf_variable) %in% shared_variant))
  
# also, for qtl data, only keep variants that are shared
qtl_lbf_mtx = qtl_file$dlpfc_eqtl$lbf_variable[,shared_variant]

## Step 4: based on lbf matrix to get new alpha, PIP and cs

In [12]:
# convert lbf to alpha
lbf_to_alpha_vector = function(lbf, prior_weights = NULL) {
  if (is.null(prior_weights)) prior_weights = 1/length(lbf)
  maxlbf = max(lbf)
  # w is proportional to BF, but subtract max for numerical stability.
  w = exp(lbf - maxlbf)
  # Posterior prob for each SNP.
  w_weighted = w * prior_weights
  weighted_sum_w = sum(w_weighted)
  alpha = w_weighted / weighted_sum_w
  return(alpha)
}

lbf_to_alpha = function(lbf) t(apply(lbf, 1, lbf_to_alpha_vector))

# convert lbf to pip
lbf_to_pip = function(lbf) {
    alpha = lbf_to_alpha(lbf)
    return(as.vector(1 - apply(1 - alpha,2,prod)))
}

# use new lbf matrix to compute alpha and pip
GWAS_alpha = lbf_to_alpha(GWAS_lbf_matrix)
GWAS_pip = lbf_to_pip(GWAS_lbf_matrix)

qtl_alpha = lbf_to_alpha(qtl_lbf_mtx)
qtl_pip = lbf_to_pip(qtl_lbf_mtx)
  

# for now, we remove the variants in cs, if that variant is not shared by two traits
## note: this part may be changed!! now is just a rough strategy
cs_number = length(qtl_file$dlpfc_eqtl$sets$cs)
new_cs = list()
if(cs_number == 0){
    new_cs = NA
}else{
    for (i in (1:cs_number)){
     new_cs[[i]] = setdiff(qtl_file$dlpfc_eqtl$sets$cs[[i]], rm_index)
    }
}

# the output can be: GWAS+ qtl lbf matrix; alpha; pip
# and qtl new cs

In [13]:
str(qtl_pip)

 num [1:10384] 0.000905 0.00082 0.000818 0.000915 0.000815 ...


In [14]:
str(GWAS_pip)

 num [1:10384] 0.00289 0.00289 0.00289 0.00289 0.00289 ...


Now they share same variants and variant number, can do susie_coloc!

In [16]:
str(GWAS_alpha)

 num [1:30, 1:10384] 9.63e-05 9.63e-05 9.63e-05 9.63e-05 9.63e-05 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:10384] "chr7_76680351_T_C" "chr7_76680425_A_C" "chr7_76681272_G_GA" "chr7_76681467_T_C" ...


In [17]:
str(qtl_alpha)

 num [1:10, 1:10384] 9.16e-05 9.16e-05 9.13e-05 9.07e-05 9.02e-05 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:10384] "chr7_76680351_T_C" "chr7_76680425_A_C" "chr7_76681272_G_GA" "chr7_76681467_T_C" ...


Gwas matrix have 30 rows, which correspond to our design. And they have same column number.

In [15]:
str(new_cs)

 logi NA
