# Fine-mapping with SuSiE RSS model

This notebook take a list of LD reference files and a list of sumstat files from various association studies ...

## Input

1. **FIXME we need to make input as a bed file with chrom, start and end** A tab delimated table describing the path where LD per region stored, can be generated using the ld_per_region_plink step of the genotype processing module.

```
#id     dir
chr17_60570445_65149278 /mnt/vast/hpc/csg/molecular_phenotype_calling/LD/output_npz_2/1300_hg38_EUR_LD_blocks_npz_files/ROSMAP_NIA_WGS.leftnorm.filtered.filtered.chr17_60570445_65149278.flt16.npz
```

2. A tab delimated table describing path where summary stat per chromosome stored, can be generated using the yml_generator module before the qced sumstat are generated. **FIXME: If the chrom name is zero that means the data is genome-wide**
```
hs3163@csglogin:/mnt/vast/hpc/csg/xqtl_workflow_testing/susie_rss$ cat /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022/qced_sumstat_list.txt
#chr    ADGWAS_Bellenguez_2022
1       /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022/ADGWAS_Bellenguez_2022.1/ADGWAS2022.chr1.sumstat.tsv
2       /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022/ADGWAS_Bellenguez_2022.2/ADGWAS2022.chr2.sumstat.tsv
3       /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022/ADGWAS_Bellenguez_2022.3/ADGWAS2022.chr3.sumstat.tsv
4       /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022/ADGWAS_Bellenguez_2022.4/ADGWAS2022.chr4.sumstat.tsv
5       /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022/ADGWAS_Bellenguez_2022.5/ADGWAS2022.chr5.sumstat.tsv
```

3. Regions we want to analyze in the format `chr:start-end`. Can be multiple of these. If not specified we will use the regions in the LD data list

## Output

1. A RDS file containing the output susie object, the name of all variants that went through the analysis, the z score , and the LD used for the analysis.
2. A sumstat file with additional column containing the slalom results.

## MWE

In [None]:
sos run pipeline/SuSiE_RSS.ipynb SuSiE_RSS \
    --ld-data test.ld.list \
    --sumstats /mnt/vast/hpc/csg/xqtl_workflow_testing/ADGWAS/data_intergration/ADGWAS2022/qced_sumstat_list.txt \
    --container oras://ghcr.io/cumc/pecotmr_apptainer:latest --impute --cwd output_impute_2

In [None]:
[global]
parameter: cwd = path("output")
# getting the overlapped input
parameter: ld_data = path
parameter: sumstats = paths
import pandas as pd
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 3

parameter: lead_idx_choice = "pvalue"
parameter: abf_prior_variance = 0.4
parameter: nlog10p_dentist_s_threshold = 4
parameter: r2_threshold = 0.6
parameter: n = 0
parameter: max_iter = 1000
parameter: impute = True # Whether to impute the sumstat for all the snp in LD but not in sumstat.

In [None]:
[get_analysis_regions: shared = "regional_data"]
# This will pair the LD matrix blocks with each of the input summary stats

LD_list = pd.read_csv(LD_list,sep="\t")
sumstat_list = pd.read_csv(sumstats,sep="\t")
LD_list["#chr"] = [x[0].replace("chr", "") for x in  LD_list["#id"].str.split("_") ]
sumstat_list["#chr"] = [str(x).replace("chr", "") for x in  sumstat_list["#chr"] ]
input_inv = LD_list.merge(sumstat_list)
input_list = input_inv.iloc[:,[1,3]].values.tolist()

In [None]:
[SuSiE_RSS_1]
parameter: L = 10
parameter: max_L = 1000

depends: sos_variable("regional_data")

meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = 2, group_with = "meta_info"
# name = f'{_input[0]:b}'.split(".")[-3]
output: f'{cwd:a}/{_input[1]:bn}.{name}.unisusie_rss.fit.rds',
        f'{cwd:a}/{_input[1]:bn}.{name}.unisusie_rss.ss_qced.tsv'    
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand = '${ }', stdout = f"{_output[0]:nn}.stdout", stderr = f"{_output[0]:nn}.stderr", container = container, entrypoint = entrypoint
  
    ## Step 1: Load summary stats and LD data for a region, and match them, using the function in pecotmr::LD.R

    ## Step 2: basic QC between LD and summary stats --- to correct allele flipping mainly in pecotmr 
  
    ## Step 3: Perform SuSiE RSS with QC using my prototype
  
    ## Output are 1) RDS file of fine-mapping results and 2) summary stats file for the region after allele flipping QC as well as the SuSiE RSS based QC
    ## For fine-mapping results we would like to report both the top variant model (LD  reference free) and the conventional fine-mapping results
  
    ## Ater that we repeat Step 1 and Step 3 with RSS QC (susie_rss as is). 

In [None]:
[SuSiE_RSS_2]
output: pip_plot = f"{cwd}/{_input:bn}.png"
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '20G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: container=container, expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint = entrypoint
    res = readRDS(${_input:r})
    png(${_output[0]:r}, width = 14, height=6, unit='in', res=300)
    par(mfrow=c(1,2))
    susieR::susie_plot(res, y= "PIP", pos=list(attr='pos',start=res$pos[1],end=res$pos[length(res$pos)]), add_legend=T, xlab="position")
    susieR::susie_plot(res, y= "z", pos=list(attr='pos',start=res$pos[1],end=res$pos[length(res$pos)]), add_legend=T, xlab="position", ylab="-log10(p)")
    dev.off()

In [None]:
[SuSiE_RSS_3]
input: group_by = 'all'
output: analysis_summary = f'{cwd}/{sumstats_path:bnn}.analysis_summary.md', variants_csv = f'{cwd}/{sumstats_path:bnn}.variants.csv'
R: container=container, expand = "${ }", entrypoint = entrypoint
    # Define the theme string
    theme <- '---
    theme: base-theme
    style: |
     p {
       font-size: 24px;
       height: 900px;
       margin-top:1cm;
      }
      img {
        height: 70%;
        display: block;
        margin-left: auto;
        margin-right: auto;
      }
      body {
       margin-top: auto;
       margin-bottom: auto;
       font-family: verdana;
      }
    ---    
    '
    text <- ""
    sep <- '\n\n---\n'

    inp <- strsplit("${_input:r}", " ")[[1]]
    inp <- sapply(inp, function(x) paste(head(strsplit(x, "\\.")[[1]], -1), collapse = "."))

    r <- unique(strsplit("${_input:bn}", " ")[[1]])

    num_csets <- numeric()
    region_info <- character()

    variant_info <- list()

    for (reg_i in seq_along(unique(inp))) {

      rid <- unlist(strsplit(r[reg_i], '\\.'))[1]

      text_temp <- ""
      text_temp <- paste0(text_temp, "#\n\n SuSiE RSS ", r[reg_i], " \n")
      text_temp <- paste0(text_temp, "![](", r[reg_i], ".png)", sep, " \n \n")

      rd <- readRDS(substr(each, 2, nchar(each)) + ".rds")

      # find the number of cs in the current region
      if (is.null(rd$sets$cs)) {
        num_csets <- c(num_csets, 0)
      } else {
        num_csets <- c(num_csets, length(rd$sets$cs))
      }
      cat(num_csets, "\n")

      # this will store the indices of all variants that cross the threshold
      ind_p <- which(rd$pip >= ${pip_cutoff})
      sumvars <- 0

      # if we have at least one cs in the current region
      if (num_csets[reg_i] > 0) {
        tbl_header <- "| chr number | pos at highest pip | ref | alt | region id | cs | highest pip |  \n| --- | --- | --- | --- | --- | --- | --- |  \n"

        table <- ""

        sumpips <- 0

        for (cset in names(rd$sets$cs)) {
          print(cset)

          # if we have many variants in the cs
          if (length(rd$sets$cs[[cset]]) > 1) {
            highestpip <- max(rd$pip[rd$sets$cs[[cset]]])
            poswhighestpip <- which.max(rd$pip[rd$sets$cs[[cset]]])

            # we make sure that ind_p only stores the variants that aren't in any cs
            ind_p <- setdiff(ind_p, rd$sets$cs[[cset]])

            # append variant info
            i <- poswhighestpip
            variant_info[[length(variant_info) + 1]] <- list(rd$chr[i], rd$pos[i], rd$ref[i], rd$alt[i], rid, cset, rd$pip[i])

            table <- paste0(table, "| ", rd$chr[i], " | ", rd$pos[i], " | ", rd$ref[i], " | ", rd$alt[i], " | ", rid, " | ", cset, " | ", sprintf("%.2f", rd$pip[i]), " |  \n")

            sumpips <- sumpips + sum(rd$pip[rd$sets$cs[[cset]]])
            sumvars <- sumvars + length(rd$sets$cs[[cset]])
          } else { # if we have only one variant in the cs
            i <- rd$sets$cs[[cset]]

            # we make sure that ind_p only stores the variants that aren't in any cs
            ind_p <- setdiff(ind_p, i)

            # append variant info
            variant_info[[length(variant_info) + 1]] <- list(rd$chr[i], rd$pos[i], rd$ref[i], rd$alt[i], rid, cset, rd$pip[i])

            table <- paste0(table, "| ", rd$chr[i], " | ", rd$pos[i], " | ", rd$ref[i], " | ", rd$alt[i], " | ", rid, " | ", cset, " | ", sprintf("%.2f", rd$pip[i]), " |  \n")

            sumpips <- sumpips + rd$pip[i]
            sumvars <- sumvars + 1
          }
        }

        text_temp <- paste0(text_temp, "- Total number of variants: ", length(rd$pip), "\n")
        text_temp <- paste0(text_temp, "- Expected number of causal variants: ", sprintf("%.2f", sumpips), "\n")
        text_temp <- paste0(text_temp, "- Number of variants with PIP > ", ${pip_cutoff}, " and not in any CS: ", length(ind_p), "\n\n")
        text_temp <- paste0(text_temp, tbl_header, table, sep)

        if (num_csets[reg_i] > 1) {
          text_temp <- paste0(text_temp, "#### CORR: Correlation between CS | OLAP: Overlap between CS\n")

          cs <- names(rd$sets$cs)

          corrheader <- "|  |"
          corrbreak <- "| --- |"

          for (i in cs) {
            corrheader <- paste0(corrheader, " CORR ", i, " |")
            corrbreak <- paste0(corrbreak, " --- |")
          }

          corrheader <- paste0(corrheader, "  |")
          corrbreak <- paste0(corrbreak, " --- |")

          for (i in cs) {
            corrheader <- paste0(corrheader, " OLAP ", i, " |")
            corrbreak <- paste0(corrbreak, " --- |")
          }

          corrheader <- paste0(corrheader, "\n")
          corrbreak <- paste0(corrbreak, "\n")

          body <- ""

          for (en in seq_along(cs)) {
            i <- cs[en]
            body <- paste0(body, "| ", i, " |")
            for (j in rd$cscorr[[en]]) {
              body <- paste0(body, " ", sprintf("%.2f", j), " |")
            }
            body <- paste0(body, "  |")
            for (j in names(rd$sets$cs)) {
              body <- paste0(body, " ", length(intersect(rd$sets$cs[[i]], rd$sets$cs[[j]])), " |")
            }
            body <- paste0(body, "\n")
          }

          text_temp <- paste0(text_temp, corrheader, corrbreak, body, sep)
        }

        region_info <- c(region_info, text_temp)
      }
    }

    f <- file(${_output["analysis_summary"]:r}, "w")
    writeLines(paste0(theme, text), f)
    close(f)

    for (i in ind_p) {
      # append variant info
      variant_info[[length(variant_info) + 1]] <- list(rd$chr[i], rd$pos[i], rd$ref[i], rd$alt[i], rid, "None", rd$pip[i])
    }

    df <- do.call(rbind, variant_info)
    colnames(df) <- c("chr", "pos", "ref", "alt", "rid", "cs", "pip")
    write.table(df, ${_output["variants_csv"]:r}, sep = "\t", row.names = TRUE, col.names = TRUE)

In [None]:
# Generate analysis report: HTML file, and optionally PPTX file
[SuSiE_RSS_4]
output: f"{_input['analysis_summary']:n}.html"
sh: container=container_marp, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint = entrypoint
    node /opt/marp/.cli/marp-cli.js ${_input['analysis_summary']} -o ${_output:a} \
        --title '${region_file:bnn} fine mapping analysis' \
        --allow-local-files
    node /opt/marp/.cli/marp-cli.js ${_input['analysis_summary']} -o ${_output:an}.pptx \
        --title '${region_file:bnn} fine mapping analysis' \
        --allow-local-files