# Fine-mapping result post processing

This pipeline consolidates results from various fine-mapping tools to uniform format, add rsID as necessary, and perform a simple "liftover" via rsID (not the formal UCSC `liftOver`) to generate output in HG37 and HG38 builds.

This pipeline was devised by Gao Wang and implemented by Gao Wang and Kushal Dey at Harvard University.

## Input data

Input are results of fine-mapping pipeline `summary_statistics_finemapping.ipynb` in R's `RDS` format for SuSiE and CAVIAR, and `pkl` format for DAP.

## Output data

columns are:

```
chr pos ref alt snp_id locus_id PIP CS
```

where `snp_id` will be rsID if some annotation files on rsID are provided. Otherwise it will take the format of `chr:pos:ref:alt`.

## Additional data processing

Here we also provide additional routines to process the data,

1. Swap the `snp_id` column using external annotations, for example by rsID.
2. "liftover" to other builds -- we only support it via rsID matching

Notice that only the first 5 columns are necessary for these additional operations. The columns after the fifth can be arbitary and will be kept during the process.

- To trigger optional step 1, parameter `--id-map-prefix` and `--id-map-suffix` have to be valid.
- To trigger optional step 2, parameter `--build-map-prefix` and `--build-map-suffix` have to be valid.

In [1]:
%cd ~/GIT/github/fine-mapping

/home/gaow/Documents/GIT/github/fine-mapping

## The workflow

In [2]:
sos run workflow/finemapping_results_wrangler.ipynb -h

usage: sos run workflow/finemapping_results_wrangler.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  default

Global Workflow Options:
  --ss-data-prefix . (as path)
                        summary statistics file prefix which is the path to all
                        output files
  --pattern 'uniform.SuSiE_B.L_5.prior_0p005.res_var_false'
                        identifier for fine-mapping results to be extracted

Sections
  default_1:
    Workflow Options:
      --pip-thresh 0.05 (as float)
                        Keep PIP above these thresholds
      --round-off 6 (as int)
                        Round PIP to given digits


In [None]:
[global]
# summary statistics file prefix
# which is the path to all output files
parameter: ss_data_prefix = path()
# identifier for fine-mapping results to be extracted
parameter: pattern = "uniform.SuSiE_B.L_5.prior_0p005.res_var_false"

if 'SuSiE' in pattern:
    source = 'susie'
elif 'CAVIAR' in pattern:
    source = 'CAVIAR'
elif 'DAP' in pattern:
    source = 'DAP'
else:
    raise ValueError("Invalid --pattern specification")

In [None]:
# Consolidate fine-mapping results
[default_1 (extract results to single file)]
depends: R_library('dscrutils') # can be installed via `devtools::install_github("stephenslab/dsc",subdir = "dscrutils", force = TRUE)`
# Keep PIP above these thresholds
parameter: pip_thresh = 0.05
# Round PIP to given digits
parameter: round_off = 6
input: [glob.glob(f'{ss_data_prefix:a}/*/*.{pattern}.{ext}') for ext in ['rds', 'pkl']]
output: f"{ss_data_prefix:a}.{pattern}.gz"
fail_if(len(_input) == 0, msg = f'Cannot find valid input files by pattern {ss_data_prefix:a}/*/*.{pattern}.[rds,pkl]')
R: expand = "${ }", workdir = ss_data_prefix
    # Here we define get_*_output functions for different output format
    get_susie_output = function(unit, rds_file) {
        cs_id = rep(0, length(rds_file$var_names))
        num_cs = length(rds_file$sets$cs)
        for(id in 1:num_cs){
            cs_id[rds_file$sets$cs[[id]]] = id
        }
        cbind.data.frame(rep(unit, length(rds_file$var_names)), 
                              rds_file$var_names, 
                              rds_file$pip, cs_id)
    }
    # Data extraction script
    library(data.table)
    files = c(${paths([x.relative_to(ss_data_prefix) for x in _input]):r,})
    processed_dat = c()
    for (f in files) {
      rds_file = dscrutils::read_dsc(f)
      unit = dirname(f)
      processed_dat_temp = get_${source}_output(unit, rds_file)
      colnames(processed_dat_temp) = c("locus_id", "variant_id", "pip", "cs")
      processed_dat = rbind(processed_dat, processed_dat_temp[which(processed_dat_temp[,3] >= ${pip_thresh} | processed_dat_temp[,4] > 0), ])
      cat("We are at unit", unit, "\n")
    }

    extract_chr = sapply(processed_dat[,2], function(x) return(as.numeric(strsplit(as.character(x), ":")[[1]][1])))
    extract_pos = sapply(processed_dat[,2], function(x) return(as.numeric(strsplit(as.character(x), ":")[[1]][2])))
    extract_ref = sapply(processed_dat[,2], function(x) return(as.character(strsplit(as.character(x), ":")[[1]][3])))
    extract_alt = sapply(processed_dat[,2], function(x) return(as.character(strsplit(as.character(x), ":")[[1]][4])))
    variant_id = processed_dat[,2]
    locus_id = processed_dat[,1]
    pip = round(processed_dat[,3], ${round_off})
    cs = processed_dat[,4]
    df = data.frame("chr" = extract_chr, "pos" = extract_pos, "ref" = extract_ref, "alt" = extract_alt, 
                    "variant_id" = variant_id, "locus_id" = locus_id, "pip" = pip, "cs" = cs)
    df_sorted = df [order(df[,1], df[,2]),]
    write.table(df_sorted, gzfile(${_output:r}), sep = "\t", quote=FALSE, row.names=FALSE)

In [None]:
# Optional step: update variant ID
[default_2 (update variant ID from per chrom files)]
# Path containing files for variant ID update rule
# Each file is a separate chromsome
parameter: id_map_prefix = path()
parameter: id_map_suffix = '.bim'
# columns first element for position, 2nd element for variant ID
# is [4,2] for BIM files
parameter: columns = [4,2]
# chromosome identifiers
parameter: chroms = [x+1 for x in range(22)]
stop_if(len(glob.glob(f"{id_map_prefix:a}/*.{id_map_suffix}")) == 0, msg = 'Variant ID are not updated because no valid file is found using --id-map-prefix and --id-map-suffix')
output: f'{_input:n}.var_id_updated.gz'
R: expand = "${ }"
    library(dplyr)
    library(data.table)
    out =  data.frame(fread("zcat ${_input}"))
    out %>% mutate_if(is.factor, as.character) -> out
    chroms = c(${paths(chroms):r,})
    for(numchr in chroms){
        which_chr = which(out$chr == numchr)
        out_sub = out[which_chr, ]
        dbfile = data.frame(fread(paste0("${id_map_prefix}", numchr, "${id_map_suffix}")))
        out_sub_new = out_sub
        idx1 = match(out_sub$pos, dbfile$V${columns[0]})
        idx2 = idx1[which(!is.na(idx1))]
        idx3 = 1:length(out_sub$pos)
        idx4 = idx3[which(!is.na(idx1))]
        out_sub_new[idx4, "variant_id"] = dbfile[idx2,"V${columns[1]}"]
        out[which_chr, ] = out_sub_new
        cat("Replaced IDs for chr", numchr, "\n")
    }
    write.table(out, gzfile(${_output:r})), sep = "\t", quote=FALSE, row.names=FALSE)

## Example run

In [None]:
sos run workflow/finemapping_results_wrangler.ipynb \
    --ss-data-prefix ~/tmp/01-Jan-2019 \
    --pattern uniform.SuSiE_B.L_5.prior_0p005.res_var_false