




# fastENLOC from SuSiE objects

[fastENLOC](https://github.com/xqwen/fastenloc) enables integrative genetic association analysis of molecular QTL data and GWAS data. 

### NOTE:
Before running this pipeline, please ensure you have upgraded SOS to version 0.24.0 or higher, otherwise you will not be able to successfully call our new container.

## Overview

The goal of this module is to perform fastENLOC analysis from SuSiE objects, including:
1. Conversion of SuSie eQTL Objects to DAP-G VCF Format
2. Run fastENLOC with the converted outputs from step1 


## Input

1. QTL susie table：
    - This table has two columns for `molecular_trait_id` and `susie_file`: target gene and corresponding susie output rds respectively.
2. GWAS susie table: 
    - This table has two columns for `ld_block` and `susie_object_file`: LD block and corresponding susie output rds respectively.


### NOTE2: 
Prepare for inputs:   
we should add those codes in the final step of fine-mapping, below is the example for grab gene name from eQTL and GWAS fine mapped results, which could be different from different QTL data. It would be easier if we generate that table from upstream

- 1. QTL susie table:
```
import os
import pandas as pd

path = "/mnt/vast/hpc/csg/rf2872/Work/INTACT/DLPFC/padding/output/eQTL"

filenames = []
filepaths = []

for file in os.listdir(path):
    if file.endswith(".rds"):
        middle_name = file.split('.')[2] # 
        filenames.append(middle_name)
        full_path = os.path.join(path, file)
        filepaths.append(full_path)

df = pd.DataFrame({
    'molecular_trait_id': filenames,
    'susie_file': filepaths
})

# Save the DataFrame as a tab-delimited text file
df.to_csv('eqtl_susie_table.txt', sep='\t', index=False)
```


- 2. ADWAS susie table: 

```
import os
import pandas as pd

path = "/mnt/vast/hpc/csg/rf2872/Work/INTACT/DLPFC/padding/output/GWAS"

filenames = []
filepaths = []

for file in os.listdir(path):
    if file.endswith(".rds"):
        middle_name = file.split('.')[2] # 
        filenames.append(middle_name)
        full_path = os.path.join(path, file)
        filepaths.append(full_path)

df = pd.DataFrame({
    'ld_block': filenames,
    'susie_object_file': filepaths
})

df.to_csv('ADGWAS_susie_table.txt', sep='\t', index=False)
```

## Output
1. re-formatted eQTL finemapped results from step `susie_to_dapg`
2. re-formatted GWAS finemapped results from step `fastenloc`
3. fastENLOC results from eQTL finemapped results and GWAS finemapped results.
    

### Explanation of the fast results

1) Enrichment analysis result `prefix.enloc.enrich.rst`: estimated enrichment parameters and standard errors.

2) Signal-level colocalization result `prefix.enloc.sig.out`: the main output from the colocalization analysis wi th the following format
- column 1: signal cluster name (from eQTL analysis)
- column 2: number of member SNPs
- column 3: cluster PIP of eQTLs
- column 4: cluster PIP of GWAS hits (without eQTL prior)
- column 5: cluster PIP of GWAS hits (with eQTL prior)
- column 6: regional colocalization probability (RCP)

3) SNP-level colocalization result `prefix.enloc.snp.out`: SNP-level colocalization output with the following form at
- column 1: signal cluster name
- column 2: SNP name
- column 3: SNP-level PIP of eQTLs
- column 4: SNP-level PIP of GWAS (without eQTL prior)
- column 5: SNP-level PIP of GWAS (with eQTL prior)
- column 6: SNP-level colocalization probability

4) Sorted list of colocalization signals with  

  ```sort -grk6 prefix.enloc.sig.out ```

# Example
We now run an example of this using the vcf file generated from the sample of susie eQTLs.

In [None]:
cd /mnt/vast/hpc/csg/rf2872/Work/INTACT/DLPFC

sos run pipeline/fastenloc.ipynb susie_to_dapg \
    --susie-table eqtl_susie_table.txt  \
    --tissue DLPFC  -J 50 -c ~/test/csg.yml -q csg --job_size 1 -s build

In [None]:
sos run pipeline/fastenloc.ipynb fastenloc  \
    --file-table ADGWAS_susie_table.txt   \
    --qtl_vcf output/fastenloc/eQTL.susie_to_DAPG.vcf.gz \
    --out-pre eQTL_ADGWAS \
    --tissue DLPFC \
    --container /mnt/mfs/hgrcgrid/homes/zq2209/output8/fastenloc/fastenloc.sif \
   -J 50 -c ~/test/csg.yml -q csg --job_size 1 -s build --cwd  output/fastenloc --mem 80G

In [1]:
[global]
# Workdir
parameter: cwd = path("output")
# susie_table is the table of eQTL fine mapped results, which has two columns for gene and susie_fils
parameter: susie_table = ""
# This vcf is derived from the conversion of the susie rds for each gene, with the relevant information noted in the INFO column.
#parameter: out_vcf = ""
# file_table is the table of GWAS fine mapped results, which has two columns for LD blocks and susie_fils
parameter: file_table = ""
# out_file is a temporary file in the environment
parameter: out_file = ""
# the prefix of fastENLOC output 
parameter: out_pre = ""
# the zipped file of out_vcf, which is derived from the conversion of the susie rds for each gene
parameter: qtl_vcf = ""
# dataset 
parameter: tissue = ''
# QTL data type
parameter: QTL = 'eQTL'
# GWAS data type
parameter: GWAS = 'GWAS'
parameter: container = ''
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
parameter: job_size = 1
parameter: walltime = "5h"
parameter: mem = "8G"
parameter: numThreads = 1

qtl_vcf = file_target(f"{cwd:a}/{QTL}.susie_to_DAPG.vcf.gz")
import os
if not os.path.exists(f'{cwd}/cache/'):
    os.makedirs(f'{cwd}/cache/')

# Conversion of SuSiE eQTL Objects to DAP-G VCF Format
In order to properly run fastENLOC we need to first convert eQTL results from each gene in a specific tissue into a VCF object with relevant information in the INFO column. To do this we need to start with a table of each gene and its corresponding eQTL susie file as the input.

It should be formatted like the file below

In [17]:
head -n5 eqtl_susie_table_head300.txt

molecular_trait_id	susie_file
ENSG00000000419	/mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad/cache/demo.ENSG00000000419.unisusie.fit.rds
ENSG00000000457	/mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad/cache/demo.ENSG00000000457.unisusie.fit.rds
ENSG00000000938	/mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad/cache/demo.ENSG00000000938.unisusie.fit.rds
ENSG00000000971	/mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad/cache/demo.ENSG00000000971.unisusie.fit.rds


In [None]:
[susie_to_dapg_1]
import pandas as pd
df = pd.read_csv(susie_table)
file_paths = []
for i, group in df.groupby(df.index // 100):
    output_file_path = f'{cwd}/cache/{QTL}.chunk_{i}.csv'
    group.to_csv(output_file_path, index=False)
    file_paths.append(output_file_path)  
input: file_paths, group_by = 1
output: f'{cwd}/fastenloc/cache/{_input:bn}.vcf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    library(susieR)
    library(stringr)
    
    susie_tbl = read.csv("${_input}", sep = "\t")
    susie_files = susie_tbl$susie_file
    genes = susie_tbl$molecular_trait_id
    tissue = "${tissue}"

    vcf_out = data.frame(chr=NULL, pos=NULL, var_id=NULL, ref = NULL, alt = NULL, info=NULL)
    for(i in seq(1, length(genes))) {
      gene = genes[i]
      ssie_obj = readRDS(susie_files[i])
      # now we get the credible set level values
      # get the average PIP for each of the credible sets
      sums_cs_pip = lapply(ssie_obj$sets$cs, function(set) sum(ssie_obj$pip[set]))
      # get length of each of the credible sets
      lengths_cs = lengths(ssie_obj$sets$cs)
      for(i_var in seq(1, length(ssie_obj$variable_name))) {
        #var_id = str_replace_all(ssie_obj$variable_name[i_var], ":", "_")
        var_id = ssie_obj$variable_name[i_var]
        chr = strsplit(var_id, "[:|_]")[[1]][1]
        pos = strsplit(var_id, "[:|_]")[[1]][2]
        ref = strsplit(var_id, "[:|_]")[[1]][3]
        alt = strsplit(var_id, "[:|_]")[[1]][4]
        pip = ssie_obj$pip[i_var]
        cs_id = -1
        cs_ids = -1
        if (i_var %in% unlist(ssie_obj$sets$cs)) {  #need to clean up ``
          cs_ids = names(ssie_obj$sets$cs)[sapply(ssie_obj$sets$cs, function(set) i_var %in% set)]
          cs_ids = as.integer(str_remove(cs_ids, "L"))
        }
        # ignore if not in a credible set
        if (any(cs_ids == -1)) {
          #print("next")
          next()
        }
        # ignore if pip is below a threshold
        if(pip < 1e-04) {
          next()
        }
        for (cs_id in cs_ids) {
            sums_values <- format(sums_cs_pip[[paste0("L", cs_id)]], scientific = T)
            lengths_values <- format(lengths_cs[[paste0("L", cs_id)]], scientific = T)
            info = paste0(gene, ":", cs_id, "@", tissue, "=", format(pip, scientific = T), "[", 
                          sums_values, ":",  lengths_cs[[paste0("L", cs_id)]], "]")
            df = data.frame(chr=chr, pos=pos, var_id=var_id, ref = ref, alt = alt, info=info)
            vcf_out <- rbind(vcf_out, df)
        }
      }
    }
    write.table(vcf_out, "${_output}", sep ="\t", quote = F, row.names = F, col.names = F)

In [None]:
[susie_to_dapg_2]
input:  group_by = 'all'
output: f'{cwd:a}/fastenloc/{QTL}.susie_to_DAPG.vcf_tmp'
bash: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    cd ${cwd}/fastenloc/cache
    #module load BCFTOOLS
    #bcftools merge *.vcf -O v -o merged.vcf
    # incase we don't want to rebuild the container to add BCFTOOLS, I'd like to use `cat` here since we don't have header in the vcf file
    cat ${QTL}.*.vcf >> ${_output}
    #cd ../ && rm -r cache/


In [None]:
[susie_to_dapg_3]
input: group_by = 1
output: prior_data = f'{cwd:a}/fastenloc/{QTL}.susie_to_DAPG.vcf.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
python3: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    f = open("${_input}")
    f_out = open("${_output:n}", "w+")
    info_map = {}
    for line in f:
        elems = line.split("\t")
        elems[-1] = elems[-1].strip()
        var_idx = elems[2]
        try:
            info_map[var_idx] = info_map[var_idx] + [elems[-1]]
        except KeyError:
            info_map[var_idx] = [elems[-1]]
    for var_idx, infos in info_map.items():
        chrm, pos, ref, alt = var_idx.replace(":", "_").split("_")
        info = "|".join(infos)
        f_out.write("\t".join([chrm, pos, var_idx, ref, alt, info]) + "\n")
    f_out.close()
    f.close()
bash: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    gzip -f ${_output:n}
    rm ${_input}

**output_1**: re-formatted eQTL finemapped results from `susie_to_dapg`

In [19]:
zcat output/eQTL.susie_to_DAPG.vcf.gz | head -n5

chr1	169880000	chr1_169880000_A_G	A	G	ENSG00000000457:1@DLPFC=2.018637e-02:[1.02222e-02:93]
chr1	169880330	chr1_169880330_A_T	A	T	ENSG00000000457:1@DLPFC=2.018637e-02:[1.02222e-02:93]
chr1	169880762	chr1_169880762_CTT_CT	CTT	CT	ENSG00000000457:1@DLPFC=8.174311e-02:[1.02222e-02:93]
chr1	169880823	chr1_169880823_G_A	G	A	ENSG00000000457:1@DLPFC=2.018637e-02:[1.02222e-02:93]
chr1	169880877	chr1_169880877_T_C	T	C	ENSG00000000457:1@DLPFC=2.018637e-02:[1.02222e-02:93]

gzip: stdout: Broken pipe


# fastENLOC from SuSiE objects

The first input for the pipeline is a table listing the `SuSiE` objects for each of the LD Blocks. This takes the form of one column with the name of the LD block and another with the path to the object file.

In [18]:
head -n5 ADGWAS_susie_table.txt

ld_block	susie_object_file
chr17_1013206_2799513	/mnt/vast/hpc/csg/xqtl_workflow_testing/susie_rss/output/ADGWAS2022.chr17.sumstat.chr17_1013206_2799513.unisusie_rss.fit.rds
chr17_10670471_12764265	/mnt/vast/hpc/csg/xqtl_workflow_testing/susie_rss/output/ADGWAS2022.chr17.sumstat.chr17_10670471_12764265.unisusie_rss.fit.rds
chr17_120360_1013206	/mnt/vast/hpc/csg/xqtl_workflow_testing/susie_rss/output/ADGWAS2022.chr17.sumstat.chr17_120360_1013206.unisusie_rss.fit.rds
chr17_12764265_13625781	/mnt/vast/hpc/csg/xqtl_workflow_testing/susie_rss/output/ADGWAS2022.chr17.sumstat.chr17_12764265_13625781.unisusie_rss.fit.rds


In the first step we convert the susie objects into one table for use with fastENLOC.

And then run fastenloc on that table

In [2]:
[fastenloc_1]
import pandas as pd
df = pd.read_csv(file_table)
file_paths = []
for i, group in df.groupby(df.index // 100):
    output_file_path = f'{cwd:a}/cache/{GWAS}.chunk_{i}.csv'
    group.to_csv(output_file_path, index=False)
    file_paths.append(output_file_path)  
input: file_paths, group_by=1
output: f'{cwd:a}/cache/{_input:bn}.vcf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
  susie_tbl = read.csv('${_input}', sep = "\t")
  out_tbl = list()
  out_tbl$var = c()
  out_tbl$pip = c()
  out_tbl$set = c()
  for(idx in seq(1,nrow(susie_tbl))) {
    ld_block = susie_tbl[[1]][idx]
    filename = susie_tbl[[2]][idx]
    ssie = readRDS(filename)
    vars = ssie$variable_name
    out_tbl$var = c(out_tbl$var, vars)
    out_tbl$set = c(out_tbl$set, rep(ld_block, length(vars)))
    pip = ssie$pip
    out_tbl$pip = c(out_tbl$pip, pip)
  }
  out_tbl = as.data.frame(out_tbl)
  #out_tbl$var = paste0(out_tbl$var, "_b38")
  write.table(out_tbl, "${_output}", sep = "\t", quote = F, row.names = F, col.names = F)


In [3]:
[fastenloc_2]
input:  group_by = 'all'
output: f'{cwd:a}/{GWAS}.susie_to_DAPG.vcf.gz'
bash: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout'
    cd ${cwd:a}/cache
    #module load BCFTOOLS
    #bcftools merge *.vcf -O v -o merged.vcf
    # incase we don't want to rebuild the container to add BCFTOOLS, I'd like to use cat here since we don't have header in the vcf file
    cat ${GWAS}.*.vcf >> ${_output:n}
    gzip -f ${_output:n}
    #rm -r ${cwd:a}/cache


In [3]:
[fastenloc_3]
input: group_by=1
output: enrich_out = f'{cwd}/{out_pre}.enloc.enrich.out',
        gene_out = f'{cwd}/{out_pre}.enloc.gene.out',
        snp_out = f'{cwd}/{out_pre}.enloc.snp.out',
        sig_out = f'{cwd}/{out_pre}.enloc.sig.out'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand="${ }", stderr = f'{out_pre.split("/")[-1]}_enloc.stderr', stdout = f'{out_pre.split("/")[-1]}_enloc.stdout', container = container, entrypoint = entrypoint
    fastenloc -eqtl ${qtl_vcf} -gwas ${_input} -t ${tissue} -prefix ${cwd}/${out_pre} # fix it after the new version of the fastenloc container

**output2**: re-formatted GWAS finemapped results from `fastenloc`

In [23]:
zcat output/GWAS.susie_to_DAPG.vcf.gz | head -n5

chr17_1013206_G_A_b38	chr17_1013206_2799513	0.000744869767935419
chr17_1013226_C_T_b38	chr17_1013206_2799513	0.000943722701734817
chr17_1013285_G_A_b38	chr17_1013206_2799513	0.00121844744384469
chr17_1013417_C_T_b38	chr17_1013206_2799513	0.00588273046469134
chr17_1013425_G_A_b38	chr17_1013206_2799513	0.000875241390920545

gzip: stdout: Broken pipe


**output 3**: fastenloc results

**The example of output is from Tosin**:

Information on the fastENLOC outputs can be found [here](https://github.com/xqwen/fastenloc/blob/master/tutorial/README.md). 

In [6]:
head -n 100 /restricted/projectnb/casa/oaolayin/fastenloc_test/kunkle_DLPFC*

==> /restricted/projectnb/casa/oaolayin/fastenloc_test/kunkle_DLPFC.enloc.enrich.out <==
                Intercept    -7.802           -
Enrichment (no shrinkage)     1.853     3696.712
Enrichment (w/ shrinkage)     0.000       1.000


## Alternative (coloc) parameterization: p1 = 4.088e-04, p2 = 1.476e-06, p12 = 6.038e-10


==> /restricted/projectnb/casa/oaolayin/fastenloc_test/kunkle_DLPFC.enloc.gene.out <==
Gene		GRCP	GLCP
ENSG00000000457		2.250e-03	9.590e-02
ENSG00000000971		0.000e+00	0.000e+00
ENSG00000001084		5.420e-05	6.338e-04
ENSG00000001167		4.646e-04	1.021e-02
ENSG00000001460		0.000e+00	0.000e+00
ENSG00000001461		0.000e+00	0.000e+00
ENSG00000001561		2.034e-04	2.624e-03
ENSG00000001626		0.000e+00	0.000e+00
ENSG00000001629		1.429e-03	5.209e-02
ENSG00000001630		1.794e-03	1.040e-01
ENSG00000002016		0.000e+00	0.000e+00
ENSG00000002745		5.529e-04	1.674e-03
ENSG00000002822		2.165e-03	2.726e-02
ENSG00000002834		5.972e-04	5.972e-04
ENSG00000002933		0.000e+00	0.000e+00

==> /restricte